DOLOMITES: Domain-Specific Long-Form Methodical Tasks

2025-12-25 · Source: Transactions of the Association for Computational Linguistics · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Advanced, quick

Summary

A new benchmark called DoLoMiTes has been introduced to evaluate language models on domain-specific, long-form methodical writing tasks. Published in TACL Vol. 13 (2025), DoLoMiTes comprises specifications for 519 distinct tasks, gathered from hundreds of experts across 25 fields. These tasks, such as clinicians writing differential diagnoses or teachers creating lesson plans, require generating structured, long-form output based on specific inputs and procedures. The benchmark also includes 1,857 concrete input and output examples, derived from expert revisions of up to 10 model-generated examples per task. Initial evaluations using DoLoMiTes reveal that automating these tasks presents a significant challenge for current language models, demanding complex inferences and the integration of both provided context and specialized domain knowledge. The dataset is publicly available.

Key takeaway

For AI Scientists and Research Scientists developing or evaluating large language models, DoLoMiTes provides a critical new benchmark for assessing performance on complex, domain-specific, long-form writing tasks. You should integrate this benchmark into your evaluation pipelines to identify current model limitations in generating structured, methodical outputs that require deep contextual understanding and domain knowledge, guiding future research towards more robust and expert-level AI writing assistants.

Key insights

DoLoMiTes is a new benchmark for evaluating language models on complex, domain-specific, long-form methodical writing tasks.

Principles

Methodical tasks require structured, long-form output.
Automating these tasks demands complex inference and domain knowledge.

Method

DoLoMiTes was developed by eliciting 519 task specifications from experts across 25 fields, then collecting 1,857 concrete input/output examples via expert revisions of model-generated content.

In practice

Use DoLoMiTes to benchmark long-form generation capabilities.
Focus model development on complex inference and domain integration.

Topics

DoLoMiTes Benchmark
Long-Form Text Generation
Domain-Specific Tasks
Language Model Evaluation
Methodical Writing

Best for: AI Scientist, Research Scientist, AI Researcher, NLP Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Transactions of the Association for Computational Linguistics.