Assessing LLM Reliability on Temporally Recent Open-Domain Questions

2026-02-13 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

A new study introduces RECOM (Reddit Evaluation for Correspondence of Models), a benchmark dataset of 15,000 recent Reddit questions from September 2025, paired with community-derived reference answers, to assess Large Language Model (LLM) reliability on temporally recent, open-domain questions. Researchers evaluated four open-source LLMs: Llama-3.1-8B, Mistral-7B, Gemma-2-9B, and GPT-OSS-20B, using a multi-dimensional framework including lexical metrics (BLEU, ROUGE), semantic similarity (BERTScore, MoverScore, cosine similarity), and logical inference (NLI). The core finding is a "semantic-lexical paradox": models achieve over 99% cosine similarity with references despite less than 8% BLEU-1 overlap, indicating extensive paraphrasing. Notably, model scale does not predict performance, with Mistral-7B (7B parameters) outperforming GPT-OSS-20B (20B parameters) across all metrics. Contradiction rates remained below 7%, suggesting models rarely generate directly conflicting content.

Key takeaway

For AI scientists and research scientists evaluating LLMs for open-domain question answering, you should adopt multi-dimensional evaluation frameworks that prioritize semantic fidelity over surface-level lexical overlap. Relying solely on metrics like BLEU or ROUGE can misrepresent model capabilities, as LLMs excel at paraphrasing while maintaining meaning. Your model selection should not be based purely on parameter count, as smaller models like Mistral-7B can outperform larger ones on abstractive generation tasks, suggesting architectural and training considerations are more critical.

Key insights

LLMs achieve high semantic alignment through paraphrasing, not lexical reproduction, challenging traditional evaluation metrics.

Principles

Model scale does not guarantee superior performance.
Lexical metrics underestimate abstractive generation quality.
Multi-dimensional evaluation captures nuanced LLM capabilities.

Method

The RECOM benchmark uses 15,000 recent Reddit questions and LLM-summarized community answers. It evaluates LLMs using lexical, semantic, and NLI metrics to assess alignment with human perspectives.

In practice

Prioritize semantic metrics over lexical for abstractive tasks.
Consider smaller, well-tuned models like Mistral-7B.
Use NLI to detect factual inconsistencies in LLM outputs.

Topics

LLM Evaluation
Semantic-Lexical Paradox
Open-Domain Question Answering
RECOM Dataset
Abstractive Generation

Best for: AI Scientist, Research Scientist, AI Researcher, NLP Engineer, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.