Quantifying and Auditing LLM Evaluation via Positive--Unlabeled Learning

· Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

The PUAudit framework addresses systematic biases in "LLM-as-a-Judge" evaluation systems, which often exhibit preferences decoupled from semantic quality, such as verbosity bias. It formulates LLM evaluation under selective human supervision as a positive-unlabeled (PU) learning problem. PUAudit proposes a geometric auditing method based on Partial Optimal Transport (POT), operating in a fixed representation space without retraining the LLM judge. This approach aligns a small set of human-verified positive judgments with a reliable subset of unlabelled outputs, identifying human-consistent preferences and correcting biased LLM judges. Experiments on Chatbot Arena and MT-Bench data, using models like Mistral-7B-Instruct and Qwen2.5-7B-Instruct, demonstrate improved alignment with human preferences and enhanced robustness against presentation biases, including length, sentiment, and distraction attacks. The method shows systematic improvements across six question types (QTA-QTF), particularly benefiting open-ended reasoning tasks.

Key takeaway

For Machine Learning Engineers deploying LLMs as judges, PUAudit offers a statistically grounded method to mitigate systematic biases like verbosity or sentiment. By applying this training-free geometric auditing framework, you can improve alignment with human preferences and enhance robustness against presentation attacks without costly retraining. Consider integrating PUAudit to refine your "LLM-as-a-Judge" evaluation pipelines, especially for open-ended tasks where judges are most fragile, ensuring more reliable and human-consistent quality assessments.

Key insights

LLM evaluation bias under selective human supervision can be audited geometrically using Positive-Unlabeled learning and Optimal Transport.

Principles

Method

PUAudit constructs normalized difference embeddings from LLM preferences. It denoises human-verified positives, then uses Partial Optimal Transport to align these with unlabelled data, flipping LLM judgments with low alignment scores.

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.