Understanding LLMs in Title-Abstract Screening: From Disagreements to Recommendations
Summary
A study qualitatively investigated how large language models (LLMs) fail in title-abstract screening for systematic reviews (SRs), analyzing disagreements with human experts across six software engineering SRs and over 1,000 primary study papers. The research found Cohen's Kappa values between human consensus and LLMs ranged from 0.52 to 0.77. Qualitative analysis revealed seven recurring disagreement patterns, including "Term - Boundary," "Abstract - Information Omission," "LLM - Keyword Overweight," "LLM - Not main focus," "LLM - Incorrect Topic Inference," "Human - Error," and "Operationalization - Criteria combination." These patterns often stemmed from issues like boundary ambiguity in key terms, LLM over-reliance on keywords, or incorrect topic inference. The study utilized models such as gemini-2.5-flash and openai/gpt-4.1-mini, and in one instance, anthropic/claude-haiku-4.5.
Key takeaway
For research scientists integrating LLMs into systematic review workflows, you should anticipate specific failure modes like boundary ambiguity and keyword overweight. To enhance reliability, define inclusion/exclusion criteria unambiguously, run multiple LLMs, and evaluate each criterion separately using programmatic Boolean logic. Focus your validation efforts on borderline cases or instances where LLMs disagree, as these reveal critical insights for refining your screening process and mitigating evidence loss.
Key insights
LLMs in systematic review screening fail predictably due to semantic and lexical issues, requiring specific mitigation strategies.
Principles
- LLM screening disagreements stem from identifiable lexical and semantic issues.
- Different LLMs offer diverse interpretations, improving error detection.
- Unambiguous criteria definitions are crucial for consistent LLM application.
Method
The study used a qualitative cross-study design, analyzing disagreements between human experts and LLMs in zero-shot mode across six software engineering SRs. Divergent decisions were inductively coded to identify recurring patterns.
In practice
- Run multiple LLMs for diverse interpretations.
- Define criteria unambiguously with boundaries and confounders.
- Evaluate each screening criterion separately using Boolean logic.
Topics
- Large Language Models
- Systematic Reviews
- Title-Abstract Screening
- LLM Reliability
- Qualitative Analysis
- Software Engineering Research
Best for: Research Scientist, AI Scientist, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.