The Warrant Gap: Claim-Conditioned Re-scoring for Fact-Checking
Summary
LLM-based fact-checking systems often achieve high verdict accuracy but frequently output "Supports" labels where the cited evidence does not adequately warrant the claim. This issue, termed the "Warrant Gap," arises because structured decomposition methods, while useful for inspecting warrants, can strip away the full-claim context necessary for proper facet evaluation. Researchers Arka Ujjal Dey and John Collomosse introduce SIFT (claim-conditioned re-scoring of extracted evidence spans against the full claim) to address this. SIFT is paired with WSP (Warranted Supports Proportion), an automatic Natural Language Inference (NLI) check designed to verify if the cited warrant truly entails the claim. Evaluated across FEVER, SciFact, 5PILS, and DP benchmarks using four open-source backbones, SIFT recovers up to 27.6 points in accuracy on cells where naive decomposition performs poorly. WSP itself calibrates against human gold evidence with an AUC of 0.92 and a precision of 0.98, outperforming direct prompting methods.
Key takeaway
For NLP Engineers building fact-checking systems, the "Warrant Gap" reveals a critical reliability issue. LLMs often cite evidence that doesn't fully support claims. You should consider integrating methods like SIFT and WSP. These re-score evidence against full claims and automatically verify warrant entailment. This approach significantly improves your system's "Supports" verdicts. It recovers accuracy by up to 27.6 points, ensuring robust evidence-based reasoning.
Key insights
SIFT and WSP improve LLM fact-checking by re-scoring evidence against full claims, ensuring cited warrants truly entail claims.
Principles
- LLMs often fail to warrant claims.
- Full-claim context is vital for evidence.
- NLI checks verify evidence entailment.
Method
SIFT re-scores extracted evidence spans against the full claim. This is paired with WSP, an automatic NLI check verifying if the cited warrant entails the claim.
In practice
- Improve LLM fact-checking reliability.
- Enhance evidence-based AI reasoning.
- Reduce weakly warranted "Supports".
Topics
- Fact-Checking
- Large Language Models
- Natural Language Inference
- Evidence Verification
- SIFT
- WSP
Best for: AI Engineer, Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.