Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio

2026-06-19 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

MetaSyn is a new dataset and benchmark introduced to evaluate LLM agents on the complex, multi-stage workflow of scientific meta-analysis. Comprising 442 expert-curated meta-analyses from Nature Portfolio journals, MetaSyn includes PI/ECO criteria, a 140,585-article PubMed corpus, verified positive studies, and hard negatives. Benchmarking twelve pipeline configurations, including nine RAG variants and a protocol-driven agent, revealed a significant screening bottleneck. Despite achieving a retrieval ceiling of 90.9% recall at $K=200$, no system recovered more than 52.7% of ground-truth included literature. This indicates current LLMs struggle to reliably separate eligible studies from topically similar but PI/ECO-ineligible distractors, highlighting the need for stage-attributed metrics to diagnose specific failure points.

Key takeaway

For AI Scientists and Machine Learning Engineers developing scientific evidence synthesis tools, recognize that the primary bottleneck for LLM agents in meta-analysis is not retrieval, but robust criterion-based screening. You should prioritize developing explicit PI/ECO-gated screening components and fine-tuning retrievers on eligibility-labeled data to reliably distinguish eligible studies from topically similar distractors, thereby improving inclusion recall and trustworthiness.

Key insights

LLM agents critically fail at criterion-based screening in scientific meta-analysis, despite high retrieval recall.

Principles

Meta-analysis demands strict PI/ECO protocol adherence.
Topical relevance alone is insufficient for study screening.
Stage-attributed metrics are crucial for diagnosing system failures.

Method

MetaSyn benchmarks LLM agents on end-to-end meta-analysis generation and isolated retrieval, using expert-curated stage-level ground truth.

In practice

Train retrievers on eligibility-labeled study pairs.
Implement explicit PI/ECO-gated screening components.
Develop synthesis metrics that account for certainty.

Topics

LLM Agents
Meta-Analysis
Scientific Benchmarking
Retrieval-Augmented Generation
PI/ECO Criteria
Evidence Synthesis
Information Retrieval

Code references

BFTree/MetaSyn

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.