DOLOMITES: Domain-Specific Long-Form Methodical Tasks
Summary
A new benchmark called DoLoMiTes has been introduced to evaluate language models on domain-specific, long-form methodical writing tasks. Published in TACL Vol. 13 (2025), DoLoMiTes comprises specifications for 519 distinct tasks, gathered from hundreds of experts across 25 fields. These tasks, such as clinicians writing differential diagnoses or teachers creating lesson plans, require generating structured, long-form output based on specific inputs and procedures. The benchmark also includes 1,857 concrete input and output examples, derived from expert revisions of up to 10 model-generated examples per task. Initial evaluations using DoLoMiTes reveal that automating these tasks presents a significant challenge for current language models, demanding complex inferences and the integration of both provided context and specialized domain knowledge. The dataset is publicly available.
Key takeaway
For AI Scientists and Research Scientists developing or evaluating large language models, DoLoMiTes provides a critical new benchmark for assessing performance on complex, domain-specific, long-form writing tasks. You should integrate this benchmark into your evaluation pipelines to identify current model limitations in generating structured, methodical outputs that require deep contextual understanding and domain knowledge, guiding future research towards more robust and expert-level AI writing assistants.
Key insights
DoLoMiTes is a new benchmark for evaluating language models on complex, domain-specific, long-form methodical writing tasks.
Principles
- Methodical tasks require structured, long-form output.
- Automating these tasks demands complex inference and domain knowledge.
Method
DoLoMiTes was developed by eliciting 519 task specifications from experts across 25 fields, then collecting 1,857 concrete input/output examples via expert revisions of model-generated content.
In practice
- Use DoLoMiTes to benchmark long-form generation capabilities.
- Focus model development on complex inference and domain integration.
Topics
- DoLoMiTes Benchmark
- Long-Form Text Generation
- Domain-Specific Tasks
- Language Model Evaluation
- Methodical Writing
Best for: AI Scientist, Research Scientist, AI Researcher, NLP Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Transactions of the Association for Computational Linguistics.