MÖVE: A Holistic LLM Benchmark for the German Public Sector

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, quick

Summary

MÖVE (Modelle für die Öffentliche Verwaltung Evaluieren) is a new holistic benchmark designed to evaluate 39 large language models (LLMs) specifically for the German public sector. This benchmark addresses critical gaps in existing evaluations, which are often English-centric, US-focused, and solely emphasize task performance. MÖVE assesses models across two dimensions: Performance, covering summarization, question answering, and topic extraction, and Governance, which evaluates hallucination tendencies, energy consumption, provider transparency, and alignment with German constitutional values and political party positions. Utilizing ten German-language datasets, including custom gold- and silver-standard sets for public administration, MÖVE employs a multi-metric strategy combining classical NLP metrics, embedding-based methods, and LLM-as-a-judge approaches. Initial results indicate no single model excels across all criteria, and model size is an unreliable predictor of quality. The benchmark itself is actively developed, with results publicly available.

Key takeaway

For AI Scientists or Machine Learning Engineers evaluating LLMs for German public sector applications, you should move beyond English-centric, performance-only benchmarks. Utilize MÖVE's comprehensive evaluation, which includes critical governance criteria like hallucination, energy consumption, and alignment with German values. This will help you select models that are not only performant but also ethically sound and compliant with local requirements, avoiding the pitfalls of generic model selection.

Key insights

MÖVE provides a holistic, German-centric LLM benchmark evaluating both performance and governance for public sector applications.

Principles

Method

MÖVE evaluates 39 LLMs using ten German datasets, combining classical NLP metrics, embedding-based methods, and LLM-as-a-judge for performance and governance criteria.

In practice

Topics

Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.