Understanding LLMs in Title-Abstract Screening: From Disagreements to Recommendations

2026-06-17 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, extended

Summary

A study qualitatively investigated how large language models (LLMs) fail in title-abstract screening for systematic reviews (SRs), analyzing disagreements with human experts across six software engineering SRs and over 1,000 primary study papers. The research found Cohen's Kappa values between human consensus and LLMs ranged from 0.52 to 0.77. Qualitative analysis revealed seven recurring disagreement patterns, including "Term - Boundary," "Abstract - Information Omission," "LLM - Keyword Overweight," "LLM - Not main focus," "LLM - Incorrect Topic Inference," "Human - Error," and "Operationalization - Criteria combination." These patterns often stemmed from issues like boundary ambiguity in key terms, LLM over-reliance on keywords, or incorrect topic inference. The study utilized models such as gemini-2.5-flash and openai/gpt-4.1-mini, and in one instance, anthropic/claude-haiku-4.5.

Key takeaway

For research scientists integrating LLMs into systematic review workflows, you should anticipate specific failure modes like boundary ambiguity and keyword overweight. To enhance reliability, define inclusion/exclusion criteria unambiguously, run multiple LLMs, and evaluate each criterion separately using programmatic Boolean logic. Focus your validation efforts on borderline cases or instances where LLMs disagree, as these reveal critical insights for refining your screening process and mitigating evidence loss.

Key insights

LLMs in systematic review screening fail predictably due to semantic and lexical issues, requiring specific mitigation strategies.

Principles

LLM screening disagreements stem from identifiable lexical and semantic issues.
Different LLMs offer diverse interpretations, improving error detection.
Unambiguous criteria definitions are crucial for consistent LLM application.

Method

The study used a qualitative cross-study design, analyzing disagreements between human experts and LLMs in zero-shot mode across six software engineering SRs. Divergent decisions were inductively coded to identify recurring patterns.

In practice

Run multiple LLMs for diverse interpretations.
Define criteria unambiguously with boundaries and confounders.
Evaluate each screening criterion separately using Boolean logic.

Topics

Large Language Models
Systematic Reviews
Title-Abstract Screening
LLM Reliability
Qualitative Analysis
Software Engineering Research

Best for: Research Scientist, AI Scientist, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.