Understanding LLMs in Title-Abstract Screening: From Disagreements to Recommendations
Summary
A study investigated the reliability of large language models (LLMs) for title-abstract screening in systematic reviews (SRs), moving beyond quantitative agreement metrics to understand failure causes. Researchers analyzed disagreements between LLMs and human experts across six software engineering SRs, involving over 1,000 primary study papers. LLMs operated in zero-shot mode, yielding Kappa values from 0.52 to 0.77. Qualitative analysis revealed recurring disagreement causes, including boundary ambiguity in key terms, keyword overemphasization, and incorrect topic inference. Based on these findings, the study proposes recommendations such as validating LLM semantic understanding pre-deployment, employing multiple LLMs, and concentrating validation efforts on borderline screening cases.
Key takeaway
For MLOps Engineers deploying LLMs for systematic review screening, your focus should extend beyond aggregate agreement metrics. Instead of solely relying on Kappa values, you must qualitatively investigate disagreement causes like semantic ambiguity and keyword overemphasis. Implement pre-deployment semantic validation, consider running multiple LLMs, and prioritize human review for borderline cases to enhance screening reliability and reduce costly errors.
Key insights
LLM failures in systematic review screening are rooted in identifiable semantic and inference issues, not merely quantitative disagreement.
Principles
- LLM-human screening disagreements have identifiable causes.
- Semantic ambiguity and keyword overemphasis hinder LLM accuracy.
- Incorrect topic inference is a common LLM screening failure.
Method
Disagreements between zero-shot LLMs and human experts were qualitatively analyzed across six software engineering systematic reviews, involving over 1,000 papers, to identify failure causes.
In practice
- Validate LLM semantic understanding pre-deployment.
- Employ multiple LLMs for screening tasks.
- Prioritize validation on borderline screening cases.
Topics
- Large Language Models
- Systematic Reviews
- Title-Abstract Screening
- LLM Reliability
- Disagreement Analysis
- Zero-shot Learning
Best for: Research Scientist, AI Scientist, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.