MÖVE: A Holistic LLM Benchmark for the German Public Sector
Summary
MÖVE (Modelle für die Öffentliche Verwaltung Evaluieren) is a new holistic benchmark designed to evaluate 39 large language models (LLMs) specifically for the German public sector. This benchmark addresses critical gaps in existing evaluations, which are often English-centric, US-focused, and solely emphasize task performance. MÖVE assesses models across two dimensions: Performance, covering summarization, question answering, and topic extraction, and Governance, which evaluates hallucination tendencies, energy consumption, provider transparency, and alignment with German constitutional values and political party positions. Utilizing ten German-language datasets, including custom gold- and silver-standard sets for public administration, MÖVE employs a multi-metric strategy combining classical NLP metrics, embedding-based methods, and LLM-as-a-judge approaches. Initial results indicate no single model excels across all criteria, and model size is an unreliable predictor of quality. The benchmark itself is actively developed, with results publicly available.
Key takeaway
For AI Scientists or Machine Learning Engineers evaluating LLMs for German public sector applications, you should move beyond English-centric, performance-only benchmarks. Utilize MÖVE's comprehensive evaluation, which includes critical governance criteria like hallucination, energy consumption, and alignment with German values. This will help you select models that are not only performant but also ethically sound and compliant with local requirements, avoiding the pitfalls of generic model selection.
Key insights
MÖVE provides a holistic, German-centric LLM benchmark evaluating both performance and governance for public sector applications.
Principles
- LLM selection requires multi-dimensional evaluation beyond task performance.
- Model size is not a reliable predictor of LLM quality.
- Context-specific benchmarks are crucial for public sector LLM adoption.
Method
MÖVE evaluates 39 LLMs using ten German datasets, combining classical NLP metrics, embedding-based methods, and LLM-as-a-judge for performance and governance criteria.
In practice
- Consult MÖVE results for German public sector LLM selection.
- Consider governance criteria like hallucination and energy use.
- Develop custom datasets for domain-specific LLM evaluation.
Topics
- LLM Benchmarking
- German Public Sector
- Model Governance
- Large Language Models
- Hallucination Detection
- Energy Efficiency
Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.