Community Evals: Because we're done trusting black-box leaderboards over the community

2025-12-05 · Source: Hugging Face - Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, short

Summary

Hugging Face launched "Community Evals" on February 4, 2026, a new system to decentralize and increase transparency in AI model evaluation reporting. This initiative allows benchmark datasets on the Hugging Face Hub to host leaderboards, automatically aggregating evaluation scores directly from model repositories. Models now store their evaluation results in `.eval_results/*.yaml` files, which are then displayed on model cards and fed into benchmark leaderboards. The community can submit evaluation results for any model via pull requests, which are shown as "community" scores without requiring model author approval. Benchmarks define their evaluation specifications using an `eval.yaml` file, based on the Inspect AI format, to ensure reproducibility. This system aims to address the current discrepancies in reported benchmark scores and the gap between benchmark performance and real-world model capabilities.

Key takeaway

For NLP Engineers and AI Scientists evaluating models, Hugging Face's Community Evals provides a centralized, transparent platform to report and compare scores. You should publish your model's evaluation results directly in its repository and consider contributing community evaluations for other models. This approach fosters greater trust and reproducibility in benchmarks, helping you make more informed decisions about model selection and deployment by exposing how, when, and by whom evaluations were conducted.

Key insights

Decentralized, transparent evaluation reporting on Hugging Face Hub aims to bridge the gap between benchmarks and real-world AI performance.

Principles

Evaluation results should be reproducible.
Community input enhances evaluation transparency.

Method

Dataset repositories register as benchmarks with an `eval.yaml` spec. Model repositories store scores in `.eval_results/*.yaml`. Community members submit results via pull requests.

In practice

Publish your model's evaluation results in `.eval_results/`.
Submit community evaluation results via PRs.
Register new benchmark datasets with `eval.yaml`.

Topics

Community Evals
AI Model Benchmarking
Decentralized Evaluation
Evaluation Reproducibility
Hugging Face Hub

Best for: NLP Engineer, AI Scientist, AI Engineer, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Hugging Face - Blog.