CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks

2026-06-02 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, quick

Summary

CoEval is an open-source framework designed to rank language models for custom applications without relying on task-specific labeled data or potentially contaminated public benchmarks. It addresses the challenge of data leakage by synthesizing fresh, attribute-controlled benchmarks using teacher models, ensuring contamination-free item generation on each run. The framework then employs a cross-family judge ensemble to rank candidate models, eliminating the need for human raters or calibration. Validation shows CoEval recovers true model rankings and tracks ground-truth correctness at ho=0.86. Key findings indicate that judge-panel composition, specifically vendor diversity, is crucial for reliability, with small, diverse panels outperforming single judges. A four-task study generated 7,978 evaluations for USD 5.89, demonstrating its cost-effectiveness and reusability for any domain.

Key takeaway

For AI Engineers evaluating language models for proprietary applications, CoEval offers a critical solution to benchmark contamination and data scarcity. You can now generate reliable, custom leaderboards without human labels or pre-existing data, ensuring your model choices reflect true performance rather than memorization. This enables rapid, cost-effective re-evaluation with every new model release, significantly improving your decision-making for LLM deployment.

Key insights

The CoEval framework enables reliable, label-free LLM ranking for custom tasks by synthesizing fresh benchmarks and using diverse judge ensembles.

Principles

Judge panel diversity, not size, drives reliability.
Contamination-free benchmarks require dynamic generation.
Single judges can be anti-correlated with ground truth.

Method

CoEval synthesizes attribute-controlled benchmarks via teacher models, then uses a cross-family judge ensemble to rank candidate LLMs, ensuring contamination-free, label-free evaluation without human calibration.

In practice

Regenerate application-specific leaderboards cheaply.
Use diverse judge panels for robust LLM evaluation.
Avoid single-judge evaluations for critical tasks.

Topics

Language Model Ranking
Benchmark Contamination
Label-Free Evaluation
LLM Benchmarking
Judge Ensemble
Custom Task Evaluation

Best for: AI Architect, NLP Engineer, AI Scientist, AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.