CARE: Confounder-Aware Aggregation for Reliable LLM Evaluation

2025-05-15 · Source: cs.LG updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

The CARE (Confounder-Aware Aggregation for Reliable LLM Evaluation) framework addresses a fundamental flaw in LLM-as-a-judge ensembles, which often assume independent judge estimates despite correlated errors caused by shared latent confounders like verbosity or stylistic preferences. CARE explicitly models judge scores as arising from both a true-quality signal and shared confounding factors, separating these influences without requiring ground-truth labels. The framework offers theoretical guarantees for identifiability and finite-sample recovery, quantifying systematic bias when confounding factors are omitted. Across 12 public benchmarks, including continuous scoring, binary classification, and pairwise preference settings, CARE improves aggregation accuracy, reducing error by up to 26.8%. The framework includes two estimators, CARE-SVD for continuous scores under a joint-Gaussian assumption and CARE-Tensor for discrete or preference-based regimes, both leveraging sparse-plus-low-rank structures and tensor decomposition.

Key takeaway

For research scientists evaluating LLM outputs, you should consider adopting confounder-aware aggregation frameworks like CARE to overcome the limitations of traditional ensemble methods. This approach not only enhances the reliability and accuracy of your LLM-as-a-judge evaluations by mitigating correlated biases but also provides diagnostic insights into the latent factors influencing judge behavior, leading to more robust and interpretable assessment systems.

Key insights

CARE improves LLM-as-a-judge evaluation by explicitly modeling and separating true quality from shared latent confounders.

Principles

LLM judges exhibit correlated errors due to shared latent confounders.
Explicitly modeling confounders improves aggregation accuracy and robustness.
Identifiability and recovery are possible without ground-truth labels.

Method

CARE uses sparse-plus-low-rank decomposition to separate latent quality from confounders, employing CARE-SVD for Gaussian data and CARE-Tensor for discrete/mixture settings via tensor decomposition and graph-aware partitioning.

In practice

Use CARE to reduce evaluation error by up to 26.8% in LLM-as-a-judge systems.
Interpret latent factors to diagnose non-quality attributes like verbosity or formatting.
Integrate programmatic judges effectively by modeling their inherent biases.

Topics

LLM Evaluation
Confounder-Aware Aggregation
Weak Supervision
Latent Variable Models
Tensor Decomposition

Code references

Best for: Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.