Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio
Summary
A new dataset, MetaSyn, has been introduced to benchmark LLM agents on meta-analysis, a complex form of evidence synthesis. MetaSyn comprises 442 expert-curated meta-analyses from Nature Portfolio journals, each featuring a research question, PI/ECO criteria, a 140k PubMed article retrieval corpus, verified positive studies, and hard negatives. This dataset aims to provide ground truth across the full retrieval-screening-synthesis pipeline, which existing benchmarks lack. Benchmarking twelve pipeline configurations, including nine RAG variants and a protocol-driven agent, revealed a critical screening bottleneck. Despite achieving a retrieval ceiling of 90.9% recall at K=200, no system recovered more than 52.7% of the ground-truth included literature. Current LLMs struggle to reliably distinguish eligible studies from PI/ECO-failing distractors, even when topically similar. The study emphasizes that stage-attributed metrics are crucial for understanding system performance, rather than a single end-to-end score.
Key takeaway
For NLP Engineers developing LLM agents for scientific literature synthesis, recognize that current models exhibit a significant bottleneck in the screening phase. Despite strong retrieval capabilities, your LLM agents will likely struggle to reliably differentiate eligible studies from topically similar, PI/ECO-ineligible distractors. Prioritize research and development into improving the nuanced application of PI/ECO criteria during the screening process, rather than solely optimizing retrieval, to achieve higher accuracy in meta-analysis tasks.
Key insights
LLM agents face a critical bottleneck in screening eligible studies for meta-analysis, failing to reliably distinguish relevant from irrelevant content.
Principles
- Meta-analysis evaluates systematic reasoning.
- Stage-attributed metrics reveal system failures.
- LLMs struggle with nuanced PI/ECO criteria.
Method
The study benchmarks LLM agents by pairing research questions with PI/ECO criteria, a 140k PubMed corpus, and expert-curated ground truth, evaluating retrieval and screening performance across 12 pipeline configurations.
In practice
- Prioritize LLM screening accuracy.
- Develop better PI/ECO criteria adherence.
- Implement stage-attributed performance metrics.
Topics
- LLM Agents
- Meta-analysis
- Benchmarking
- Information Retrieval
- PI/ECO Criteria
- Retrieval-Augmented Generation
Best for: AI Scientist, Research Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.