CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks

2026-06-02 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

CoEval is an open framework designed to rank language models for custom tasks without requiring labeled data or relying on potentially contaminated public benchmarks. It addresses the limitations of generic benchmarks, which often fail to reflect specific sub-domains and may contain items leaked into model pretraining. CoEval employs an ensemble self-evaluation approach where a pool of models dynamically assumes roles as teacher (generating a fresh, contamination-free benchmark), student (answering questions), and judge (scoring responses). This process eliminates the need for human labels or raters. The framework's reliability stems from its label-free weighting system, which uses student responses to assess question discriminative power and judge consensus, effectively neutralizing broken judges and saturated questions. CoEval recovers true rankings with ρ=0.86 against ground truth and achieves Spearman 0.95 for thirteen models. Its generated items show zero verbatim overlap with five public benchmarks, and its domain-specific rankings demonstrate that generic leaderboards often misdirect practitioners.

Key takeaway

For Machine Learning Engineers selecting or fine-tuning language models for custom applications, CoEval offers a critical solution to benchmark contamination and generic evaluations. You can now generate trustworthy, domain-specific leaderboards without relying on human labels or public benchmarks. This framework allows you to quickly assess model performance for your unique sub-tasks, ensuring your chosen model is genuinely effective for your specific use case. Implement this pipeline to maintain a contamination-free, up-to-date evaluation for every model release.

Key insights

CoEval enables trustworthy, task-specific language model ranking through ensemble self-evaluation, eliminating external labels or benchmarks.

Principles

Ensemble self-evaluation yields contamination-free, domain-specific benchmarks.
Reliability in self-evaluation stems from panel composition, not size.
Label-free weighting identifies and mitigates unreliable judges or questions.

Method

A pool of models rotates through teacher (generates benchmark), student (answers), and judge (scores) roles, using student responses to weight questions and judges.

In practice

Create custom, contamination-free benchmarks for specific LLM applications.
Evaluate fine-tuned models for unique sub-domains without human labels.
Automate re-evaluation of models with each new release or update.

Topics

CoEval
Language Model Evaluation
LLM Benchmarking
Ensemble Self-Evaluation
Contamination-Free Benchmarks
Custom Task Performance

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.