CREDENCE: Claim Reduction for Decomposition & Enhanced Credibility -- Semantic Metrics and Convergence Analysis
Summary
Credence is a new framework for automated claim decomposition and evaluation, designed to improve fact-checking by addressing limitations in prior systems. It introduces Semantic-F1, a BGE-large cosine similarity-based metric that outperforms Jaccard-F1 by +15-32 percentage points, accurately crediting paraphrastic claims. The framework also provides formal convergence theorems, proving rule-based repair is monotone and finitely terminating, while LLM self-repair is non-monotone, necessitating an early-exit guard. Credence was benchmarked across four models (Phi-3-mini (3.8B), Qwen3-8B (8B), Gemma-3-12b-it (12B), Gemini Flash API) and three new datasets (SocialClaimSplit, WikiSplitBench, ClaimDecompBench). Experiments show rule-repair reduces Atomicity Violation Rate by 47–100%, and a verified Qwen3-8B can match or exceed unverified Gemini Flash on external benchmarks.
Key takeaway
For AI Scientists and ML Engineers building automated fact-checking systems, your evaluation of claim decomposition must move beyond token-overlap metrics. Adopt Semantic-F1 for accurate assessment of paraphrastic claims. When implementing repair loops, leverage rule-based methods for their proven stability and use conditional LLM self-repair judiciously to prevent quality degradation. This approach ensures more reliable and verifiable atomic claim extraction, particularly beneficial for privacy-sensitive deployments with local models like Qwen3-8B.
Key insights
Semantic-F1 and formal repair analysis improve claim decomposition for reliable automated fact-checking.
Principles
- Semantic similarity metrics (BGE-large) are superior for decomposition evaluation.
- Rule-based repair is provably monotone and finitely terminating.
- LLM self-repair can be non-monotone; use early-exit guards.
Method
The Credence pipeline uses a prompted LLM decomposer, a rule-based verifier (atomicity, entity, repetition checks), and a two-tier repairer (rule-based, then conditional LLM self-repair).
In practice
- Adopt BGE-large cosine similarity for decomposition evaluation.
- Implement conditional repair gating for high-quality LLM outputs.
- Prioritize rule-based repair for atomicity and redundancy.
Topics
- Claim Decomposition
- Automated Fact-Checking
- Semantic Metrics
- Large Language Models
- Convergence Analysis
- NLP Evaluation
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.