Understanding LLMs in Title-Abstract Screening: From Disagreements to Recommendations

2026-06-16 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, quick

Summary

A study investigated the reliability of large language models (LLMs) for title-abstract screening in systematic reviews (SRs), moving beyond quantitative agreement metrics to understand failure causes. Researchers analyzed disagreements between LLMs and human experts across six software engineering SRs, involving over 1,000 primary study papers. LLMs operated in zero-shot mode, yielding Kappa values from 0.52 to 0.77. Qualitative analysis revealed recurring disagreement causes, including boundary ambiguity in key terms, keyword overemphasization, and incorrect topic inference. Based on these findings, the study proposes recommendations such as validating LLM semantic understanding pre-deployment, employing multiple LLMs, and concentrating validation efforts on borderline screening cases.

Key takeaway

For MLOps Engineers deploying LLMs for systematic review screening, your focus should extend beyond aggregate agreement metrics. Instead of solely relying on Kappa values, you must qualitatively investigate disagreement causes like semantic ambiguity and keyword overemphasis. Implement pre-deployment semantic validation, consider running multiple LLMs, and prioritize human review for borderline cases to enhance screening reliability and reduce costly errors.

Key insights

LLM failures in systematic review screening are rooted in identifiable semantic and inference issues, not merely quantitative disagreement.

Principles

LLM-human screening disagreements have identifiable causes.
Semantic ambiguity and keyword overemphasis hinder LLM accuracy.
Incorrect topic inference is a common LLM screening failure.

Method

Disagreements between zero-shot LLMs and human experts were qualitatively analyzed across six software engineering systematic reviews, involving over 1,000 papers, to identify failure causes.

In practice

Validate LLM semantic understanding pre-deployment.
Employ multiple LLMs for screening tasks.
Prioritize validation on borderline screening cases.

Topics

Large Language Models
Systematic Reviews
Title-Abstract Screening
LLM Reliability
Disagreement Analysis
Zero-shot Learning

Best for: Research Scientist, AI Scientist, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.