Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio

2026-06-15 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

A new dataset, MetaSyn, has been introduced to benchmark LLM agents on meta-analysis, a complex form of evidence synthesis. MetaSyn comprises 442 expert-curated meta-analyses from Nature Portfolio journals, each featuring a research question, PI/ECO criteria, a 140k PubMed article retrieval corpus, verified positive studies, and hard negatives. This dataset aims to provide ground truth across the full retrieval-screening-synthesis pipeline, which existing benchmarks lack. Benchmarking twelve pipeline configurations, including nine RAG variants and a protocol-driven agent, revealed a critical screening bottleneck. Despite achieving a retrieval ceiling of 90.9% recall at K=200, no system recovered more than 52.7% of the ground-truth included literature. Current LLMs struggle to reliably distinguish eligible studies from PI/ECO-failing distractors, even when topically similar. The study emphasizes that stage-attributed metrics are crucial for understanding system performance, rather than a single end-to-end score.

Key takeaway

For NLP Engineers developing LLM agents for scientific literature synthesis, recognize that current models exhibit a significant bottleneck in the screening phase. Despite strong retrieval capabilities, your LLM agents will likely struggle to reliably differentiate eligible studies from topically similar, PI/ECO-ineligible distractors. Prioritize research and development into improving the nuanced application of PI/ECO criteria during the screening process, rather than solely optimizing retrieval, to achieve higher accuracy in meta-analysis tasks.

Key insights

LLM agents face a critical bottleneck in screening eligible studies for meta-analysis, failing to reliably distinguish relevant from irrelevant content.

Principles

Meta-analysis evaluates systematic reasoning.
Stage-attributed metrics reveal system failures.
LLMs struggle with nuanced PI/ECO criteria.

Method

The study benchmarks LLM agents by pairing research questions with PI/ECO criteria, a 140k PubMed corpus, and expert-curated ground truth, evaluating retrieval and screening performance across 12 pipeline configurations.

In practice

Prioritize LLM screening accuracy.
Develop better PI/ECO criteria adherence.
Implement stage-attributed performance metrics.

Topics

LLM Agents
Meta-analysis
Benchmarking
Information Retrieval
PI/ECO Criteria
Retrieval-Augmented Generation

Best for: AI Scientist, Research Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.