Community Evals: Because we're done trusting black-box leaderboards over the community
Summary
Hugging Face launched "Community Evals" on February 4, 2026, a new system to decentralize and increase transparency in AI model evaluation reporting. This initiative allows benchmark datasets on the Hugging Face Hub to host leaderboards, automatically aggregating evaluation scores directly from model repositories. Models now store their evaluation results in `.eval_results/*.yaml` files, which are then displayed on model cards and fed into benchmark leaderboards. The community can submit evaluation results for any model via pull requests, which are shown as "community" scores without requiring model author approval. Benchmarks define their evaluation specifications using an `eval.yaml` file, based on the Inspect AI format, to ensure reproducibility. This system aims to address the current discrepancies in reported benchmark scores and the gap between benchmark performance and real-world model capabilities.
Key takeaway
For NLP Engineers and AI Scientists evaluating models, Hugging Face's Community Evals provides a centralized, transparent platform to report and compare scores. You should publish your model's evaluation results directly in its repository and consider contributing community evaluations for other models. This approach fosters greater trust and reproducibility in benchmarks, helping you make more informed decisions about model selection and deployment by exposing how, when, and by whom evaluations were conducted.
Key insights
Decentralized, transparent evaluation reporting on Hugging Face Hub aims to bridge the gap between benchmarks and real-world AI performance.
Principles
- Evaluation results should be reproducible.
- Community input enhances evaluation transparency.
Method
Dataset repositories register as benchmarks with an `eval.yaml` spec. Model repositories store scores in `.eval_results/*.yaml`. Community members submit results via pull requests.
In practice
- Publish your model's evaluation results in `.eval_results/`.
- Submit community evaluation results via PRs.
- Register new benchmark datasets with `eval.yaml`.
Topics
- Community Evals
- AI Model Benchmarking
- Decentralized Evaluation
- Evaluation Reproducibility
- Hugging Face Hub
Best for: NLP Engineer, AI Scientist, AI Engineer, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Hugging Face - Blog.