Quantifying and Auditing LLM Evaluation via Positive--Unlabeled Learning
Summary
The PUAudit framework addresses systematic biases in "LLM-as-a-Judge" evaluation systems, which often exhibit preferences decoupled from semantic quality, such as verbosity bias. It formulates LLM evaluation under selective human supervision as a positive-unlabeled (PU) learning problem. PUAudit proposes a geometric auditing method based on Partial Optimal Transport (POT), operating in a fixed representation space without retraining the LLM judge. This approach aligns a small set of human-verified positive judgments with a reliable subset of unlabelled outputs, identifying human-consistent preferences and correcting biased LLM judges. Experiments on Chatbot Arena and MT-Bench data, using models like Mistral-7B-Instruct and Qwen2.5-7B-Instruct, demonstrate improved alignment with human preferences and enhanced robustness against presentation biases, including length, sentiment, and distraction attacks. The method shows systematic improvements across six question types (QTA-QTF), particularly benefiting open-ended reasoning tasks.
Key takeaway
For Machine Learning Engineers deploying LLMs as judges, PUAudit offers a statistically grounded method to mitigate systematic biases like verbosity or sentiment. By applying this training-free geometric auditing framework, you can improve alignment with human preferences and enhance robustness against presentation attacks without costly retraining. Consider integrating PUAudit to refine your "LLM-as-a-Judge" evaluation pipelines, especially for open-ended tasks where judges are most fragile, ensuring more reliable and human-consistent quality assessments.
Key insights
LLM evaluation bias under selective human supervision can be audited geometrically using Positive-Unlabeled learning and Optimal Transport.
Principles
- LLM judges show systematic biases, favoring superficial features over semantic quality.
- Selective human supervision yields positive-unlabeled data, requiring specialized auditing.
- Geometric alignment via Partial Optimal Transport corrects judge bias without retraining.
Method
PUAudit constructs normalized difference embeddings from LLM preferences. It denoises human-verified positives, then uses Partial Optimal Transport to align these with unlabelled data, flipping LLM judgments with low alignment scores.
In practice
- Use reward-model encoders for robust preference representations.
- Filter human-verified positives to remove geometric outliers.
- Apply POT to identify latent human-consistent preferences.
Topics
- LLM Evaluation
- LLM-as-a-Judge Bias
- Positive-Unlabeled Learning
- Partial Optimal Transport
- Geometric Auditing
- Reward Models
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.