CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

CoEval is an open framework designed to rank language models for custom tasks without requiring labeled data or relying on potentially contaminated public benchmarks. It addresses the limitations of generic benchmarks, which often fail to reflect specific sub-domains and may contain items leaked into model pretraining. CoEval employs an ensemble self-evaluation approach where a pool of models dynamically assumes roles as teacher (generating a fresh, contamination-free benchmark), student (answering questions), and judge (scoring responses). This process eliminates the need for human labels or raters. The framework's reliability stems from its label-free weighting system, which uses student responses to assess question discriminative power and judge consensus, effectively neutralizing broken judges and saturated questions. CoEval recovers true rankings with ρ=0.86 against ground truth and achieves Spearman 0.95 for thirteen models. Its generated items show zero verbatim overlap with five public benchmarks, and its domain-specific rankings demonstrate that generic leaderboards often misdirect practitioners.

Key takeaway

For Machine Learning Engineers selecting or fine-tuning language models for custom applications, CoEval offers a critical solution to benchmark contamination and generic evaluations. You can now generate trustworthy, domain-specific leaderboards without relying on human labels or public benchmarks. This framework allows you to quickly assess model performance for your unique sub-tasks, ensuring your chosen model is genuinely effective for your specific use case. Implement this pipeline to maintain a contamination-free, up-to-date evaluation for every model release.

Key insights

CoEval enables trustworthy, task-specific language model ranking through ensemble self-evaluation, eliminating external labels or benchmarks.

Principles

Method

A pool of models rotates through teacher (generates benchmark), student (answers), and judge (scores) roles, using student responses to weight questions and judges.

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.