MÖVE: A Holistic LLM Benchmark for the German Public Sector

2026-06-11 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, quick

Summary

MÖVE (Modelle für die Öffentliche Verwaltung Evaluieren) is a new holistic benchmark designed to evaluate 39 large language models (LLMs) specifically for the German public sector. This benchmark addresses critical gaps in existing evaluations, which are often English-centric, US-focused, and solely emphasize task performance. MÖVE assesses models across two dimensions: Performance, covering summarization, question answering, and topic extraction, and Governance, which evaluates hallucination tendencies, energy consumption, provider transparency, and alignment with German constitutional values and political party positions. Utilizing ten German-language datasets, including custom gold- and silver-standard sets for public administration, MÖVE employs a multi-metric strategy combining classical NLP metrics, embedding-based methods, and LLM-as-a-judge approaches. Initial results indicate no single model excels across all criteria, and model size is an unreliable predictor of quality. The benchmark itself is actively developed, with results publicly available.

Key takeaway

For AI Scientists or Machine Learning Engineers evaluating LLMs for German public sector applications, you should move beyond English-centric, performance-only benchmarks. Utilize MÖVE's comprehensive evaluation, which includes critical governance criteria like hallucination, energy consumption, and alignment with German values. This will help you select models that are not only performant but also ethically sound and compliant with local requirements, avoiding the pitfalls of generic model selection.

Key insights

MÖVE provides a holistic, German-centric LLM benchmark evaluating both performance and governance for public sector applications.

Principles

LLM selection requires multi-dimensional evaluation beyond task performance.
Model size is not a reliable predictor of LLM quality.
Context-specific benchmarks are crucial for public sector LLM adoption.

Method

MÖVE evaluates 39 LLMs using ten German datasets, combining classical NLP metrics, embedding-based methods, and LLM-as-a-judge for performance and governance criteria.

In practice

Consult MÖVE results for German public sector LLM selection.
Consider governance criteria like hallucination and energy use.
Develop custom datasets for domain-specific LLM evaluation.

Topics

LLM Benchmarking
German Public Sector
Model Governance
Large Language Models
Hallucination Detection
Energy Efficiency

Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Ethicist

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.