Assessing LLM Reliability on Temporally Recent Open-Domain Questions
Summary
A new study introduces RECOM (Reddit Evaluation for Correspondence of Models), a benchmark dataset of 15,000 recent Reddit questions from September 2025, paired with community-derived reference answers, to assess Large Language Model (LLM) reliability on temporally recent, open-domain questions. Researchers evaluated four open-source LLMs: Llama-3.1-8B, Mistral-7B, Gemma-2-9B, and GPT-OSS-20B, using a multi-dimensional framework including lexical metrics (BLEU, ROUGE), semantic similarity (BERTScore, MoverScore, cosine similarity), and logical inference (NLI). The core finding is a "semantic-lexical paradox": models achieve over 99% cosine similarity with references despite less than 8% BLEU-1 overlap, indicating extensive paraphrasing. Notably, model scale does not predict performance, with Mistral-7B (7B parameters) outperforming GPT-OSS-20B (20B parameters) across all metrics. Contradiction rates remained below 7%, suggesting models rarely generate directly conflicting content.
Key takeaway
For AI scientists and research scientists evaluating LLMs for open-domain question answering, you should adopt multi-dimensional evaluation frameworks that prioritize semantic fidelity over surface-level lexical overlap. Relying solely on metrics like BLEU or ROUGE can misrepresent model capabilities, as LLMs excel at paraphrasing while maintaining meaning. Your model selection should not be based purely on parameter count, as smaller models like Mistral-7B can outperform larger ones on abstractive generation tasks, suggesting architectural and training considerations are more critical.
Key insights
LLMs achieve high semantic alignment through paraphrasing, not lexical reproduction, challenging traditional evaluation metrics.
Principles
- Model scale does not guarantee superior performance.
- Lexical metrics underestimate abstractive generation quality.
- Multi-dimensional evaluation captures nuanced LLM capabilities.
Method
The RECOM benchmark uses 15,000 recent Reddit questions and LLM-summarized community answers. It evaluates LLMs using lexical, semantic, and NLI metrics to assess alignment with human perspectives.
In practice
- Prioritize semantic metrics over lexical for abstractive tasks.
- Consider smaller, well-tuned models like Mistral-7B.
- Use NLI to detect factual inconsistencies in LLM outputs.
Topics
- LLM Evaluation
- Semantic-Lexical Paradox
- Open-Domain Question Answering
- RECOM Dataset
- Abstractive Generation
Best for: AI Scientist, Research Scientist, AI Researcher, NLP Engineer, AI Student
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.