RECOM: A Validity Discrimination Tradeoff in Automatic Metrics for Open Ended Reddit Question Answering
Summary
The paper introduces RECOM (Reddit Evaluation for Correspondence of Models), a contamination-free dataset of 15,000 r/AskReddit questions from September 2025, each paired with authentic community replies. Evaluating five open-source LLMs (7–10B parameters) against these replies, the research identifies a "validity–discrimination tradeoff" in automatic metrics for open-ended, opinion-driven question answering. Metrics either effectively distinguish genuine content alignment from noise (validity, e.g., cosine similarity with Cohen's d≈2) or reliably rank systems (discriminative power, e.g., BERTScore precision with raw |d| up to 0.63, collapsing to |d|=0.09 when length is controlled). No single metric performs both jobs well. This tradeoff is a property of metric design, not the models, and is corroborated by three independent LLM judges.
Key takeaway
For AI Scientists and ML Engineers evaluating LLMs on open-ended tasks, you must adopt a multi-faceted metric reporting strategy. Do not rely on a single metric to both confirm content alignment and rank models, as a fundamental validity–discrimination tradeoff exists. You should report metrics on both axes, explicitly including a random-baseline floor, and always control for response length when assessing discriminative power to avoid misinterpreting verbosity as quality.
Key insights
Automatic metrics for open-ended QA exhibit a validity–discrimination tradeoff, failing to both confirm content alignment and reliably rank systems.
Principles
- Metric evaluation requires explicit random-baseline floors.
- Contrastively-trained encoders enhance validity but reduce discriminative power.
- Response length can confound metric discriminative power.
Method
RECOM dataset creation involves collecting 15,000 r/AskReddit questions from September 2025, pairing them with depth-1 community replies, and evaluating 7-10B LLMs against these using lexical, semantic, and inference-based metrics, alongside a random-derangement baseline.
In practice
- Report metrics on both validity and discriminative power axes.
- Always include an explicit random-baseline floor for metrics.
- Control for response length when assessing inter-model discrimination.
Topics
- LLM Evaluation
- Automatic Metrics
- Validity-Discrimination Tradeoff
- Open-Ended Question Answering
- RECOM Dataset
- BERTScore
- Cosine Similarity
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.