Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio
Summary
MetaSyn is a new dataset and benchmark introduced to evaluate LLM agents on the complex, multi-stage workflow of scientific meta-analysis. Comprising 442 expert-curated meta-analyses from Nature Portfolio journals, MetaSyn includes PI/ECO criteria, a 140,585-article PubMed corpus, verified positive studies, and hard negatives. Benchmarking twelve pipeline configurations, including nine RAG variants and a protocol-driven agent, revealed a significant screening bottleneck. Despite achieving a retrieval ceiling of 90.9% recall at $K=200$, no system recovered more than 52.7% of ground-truth included literature. This indicates current LLMs struggle to reliably separate eligible studies from topically similar but PI/ECO-ineligible distractors, highlighting the need for stage-attributed metrics to diagnose specific failure points.
Key takeaway
For AI Scientists and Machine Learning Engineers developing scientific evidence synthesis tools, recognize that the primary bottleneck for LLM agents in meta-analysis is not retrieval, but robust criterion-based screening. You should prioritize developing explicit PI/ECO-gated screening components and fine-tuning retrievers on eligibility-labeled data to reliably distinguish eligible studies from topically similar distractors, thereby improving inclusion recall and trustworthiness.
Key insights
LLM agents critically fail at criterion-based screening in scientific meta-analysis, despite high retrieval recall.
Principles
- Meta-analysis demands strict PI/ECO protocol adherence.
- Topical relevance alone is insufficient for study screening.
- Stage-attributed metrics are crucial for diagnosing system failures.
Method
MetaSyn benchmarks LLM agents on end-to-end meta-analysis generation and isolated retrieval, using expert-curated stage-level ground truth.
In practice
- Train retrievers on eligibility-labeled study pairs.
- Implement explicit PI/ECO-gated screening components.
- Develop synthesis metrics that account for certainty.
Topics
- LLM Agents
- Meta-Analysis
- Scientific Benchmarking
- Retrieval-Augmented Generation
- PI/ECO Criteria
- Evidence Synthesis
- Information Retrieval
Code references
Best for: AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.