RECOM: A Validity Discrimination Tradeoff in Automatic Metrics for Open Ended Reddit Question Answering

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

The paper introduces RECOM (Reddit Evaluation for Correspondence of Models), a contamination-free dataset of 15,000 r/AskReddit questions from September 2025, each paired with authentic community replies. Evaluating five open-source LLMs (7–10B parameters) against these replies, the research identifies a "validity–discrimination tradeoff" in automatic metrics for open-ended, opinion-driven question answering. Metrics either effectively distinguish genuine content alignment from noise (validity, e.g., cosine similarity with Cohen's d≈2) or reliably rank systems (discriminative power, e.g., BERTScore precision with raw |d| up to 0.63, collapsing to |d|=0.09 when length is controlled). No single metric performs both jobs well. This tradeoff is a property of metric design, not the models, and is corroborated by three independent LLM judges.

Key takeaway

For AI Scientists and ML Engineers evaluating LLMs on open-ended tasks, you must adopt a multi-faceted metric reporting strategy. Do not rely on a single metric to both confirm content alignment and rank models, as a fundamental validity–discrimination tradeoff exists. You should report metrics on both axes, explicitly including a random-baseline floor, and always control for response length when assessing discriminative power to avoid misinterpreting verbosity as quality.

Key insights

Automatic metrics for open-ended QA exhibit a validity–discrimination tradeoff, failing to both confirm content alignment and reliably rank systems.

Principles

Method

RECOM dataset creation involves collecting 15,000 r/AskReddit questions from September 2025, pairing them with depth-1 community replies, and evaluating 7-10B LLMs against these using lexical, semantic, and inference-based metrics, alongside a random-derangement baseline.

In practice

Topics

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.