CARE: Confounder-Aware Aggregation for Reliable LLM Evaluation
Summary
The CARE (Confounder-Aware Aggregation for Reliable LLM Evaluation) framework addresses a fundamental flaw in LLM-as-a-judge ensembles, which often assume independent judge estimates despite correlated errors caused by shared latent confounders like verbosity or stylistic preferences. CARE explicitly models judge scores as arising from both a true-quality signal and shared confounding factors, separating these influences without requiring ground-truth labels. The framework offers theoretical guarantees for identifiability and finite-sample recovery, quantifying systematic bias when confounding factors are omitted. Across 12 public benchmarks, including continuous scoring, binary classification, and pairwise preference settings, CARE improves aggregation accuracy, reducing error by up to 26.8%. The framework includes two estimators, CARE-SVD for continuous scores under a joint-Gaussian assumption and CARE-Tensor for discrete or preference-based regimes, both leveraging sparse-plus-low-rank structures and tensor decomposition.
Key takeaway
For research scientists evaluating LLM outputs, you should consider adopting confounder-aware aggregation frameworks like CARE to overcome the limitations of traditional ensemble methods. This approach not only enhances the reliability and accuracy of your LLM-as-a-judge evaluations by mitigating correlated biases but also provides diagnostic insights into the latent factors influencing judge behavior, leading to more robust and interpretable assessment systems.
Key insights
CARE improves LLM-as-a-judge evaluation by explicitly modeling and separating true quality from shared latent confounders.
Principles
- LLM judges exhibit correlated errors due to shared latent confounders.
- Explicitly modeling confounders improves aggregation accuracy and robustness.
- Identifiability and recovery are possible without ground-truth labels.
Method
CARE uses sparse-plus-low-rank decomposition to separate latent quality from confounders, employing CARE-SVD for Gaussian data and CARE-Tensor for discrete/mixture settings via tensor decomposition and graph-aware partitioning.
In practice
- Use CARE to reduce evaluation error by up to 26.8% in LLM-as-a-judge systems.
- Interpret latent factors to diagnose non-quality attributes like verbosity or formatting.
- Integrate programmatic judges effectively by modeling their inherent biases.
Topics
- LLM Evaluation
- Confounder-Aware Aggregation
- Weak Supervision
- Latent Variable Models
- Tensor Decomposition
Code references
Best for: Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.