CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks
Summary
CoEval is an open framework designed to rank language models for custom tasks without requiring labeled data or relying on potentially contaminated public benchmarks. It addresses the limitations of generic benchmarks, which often fail to reflect specific sub-domains and may contain items leaked into model pretraining. CoEval employs an ensemble self-evaluation approach where a pool of models dynamically assumes roles as teacher (generating a fresh, contamination-free benchmark), student (answering questions), and judge (scoring responses). This process eliminates the need for human labels or raters. The framework's reliability stems from its label-free weighting system, which uses student responses to assess question discriminative power and judge consensus, effectively neutralizing broken judges and saturated questions. CoEval recovers true rankings with ρ=0.86 against ground truth and achieves Spearman 0.95 for thirteen models. Its generated items show zero verbatim overlap with five public benchmarks, and its domain-specific rankings demonstrate that generic leaderboards often misdirect practitioners.
Key takeaway
For Machine Learning Engineers selecting or fine-tuning language models for custom applications, CoEval offers a critical solution to benchmark contamination and generic evaluations. You can now generate trustworthy, domain-specific leaderboards without relying on human labels or public benchmarks. This framework allows you to quickly assess model performance for your unique sub-tasks, ensuring your chosen model is genuinely effective for your specific use case. Implement this pipeline to maintain a contamination-free, up-to-date evaluation for every model release.
Key insights
CoEval enables trustworthy, task-specific language model ranking through ensemble self-evaluation, eliminating external labels or benchmarks.
Principles
- Ensemble self-evaluation yields contamination-free, domain-specific benchmarks.
- Reliability in self-evaluation stems from panel composition, not size.
- Label-free weighting identifies and mitigates unreliable judges or questions.
Method
A pool of models rotates through teacher (generates benchmark), student (answers), and judge (scores) roles, using student responses to weight questions and judges.
In practice
- Create custom, contamination-free benchmarks for specific LLM applications.
- Evaluate fine-tuned models for unique sub-domains without human labels.
- Automate re-evaluation of models with each new release or update.
Topics
- CoEval
- Language Model Evaluation
- LLM Benchmarking
- Ensemble Self-Evaluation
- Contamination-Free Benchmarks
- Custom Task Performance
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.