CREDENCE: Claim Reduction for Decomposition & Enhanced Credibility -- Semantic Metrics and Convergence Analysis

2026-06-19 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

Credence is a new framework for automated claim decomposition and evaluation, designed to improve fact-checking by addressing limitations in prior systems. It introduces Semantic-F1, a BGE-large cosine similarity-based metric that outperforms Jaccard-F1 by +15-32 percentage points, accurately crediting paraphrastic claims. The framework also provides formal convergence theorems, proving rule-based repair is monotone and finitely terminating, while LLM self-repair is non-monotone, necessitating an early-exit guard. Credence was benchmarked across four models (Phi-3-mini (3.8B), Qwen3-8B (8B), Gemma-3-12b-it (12B), Gemini Flash API) and three new datasets (SocialClaimSplit, WikiSplitBench, ClaimDecompBench). Experiments show rule-repair reduces Atomicity Violation Rate by 47–100%, and a verified Qwen3-8B can match or exceed unverified Gemini Flash on external benchmarks.

Key takeaway

For AI Scientists and ML Engineers building automated fact-checking systems, your evaluation of claim decomposition must move beyond token-overlap metrics. Adopt Semantic-F1 for accurate assessment of paraphrastic claims. When implementing repair loops, leverage rule-based methods for their proven stability and use conditional LLM self-repair judiciously to prevent quality degradation. This approach ensures more reliable and verifiable atomic claim extraction, particularly beneficial for privacy-sensitive deployments with local models like Qwen3-8B.

Key insights

Semantic-F1 and formal repair analysis improve claim decomposition for reliable automated fact-checking.

Principles

Semantic similarity metrics (BGE-large) are superior for decomposition evaluation.
Rule-based repair is provably monotone and finitely terminating.
LLM self-repair can be non-monotone; use early-exit guards.

Method

The Credence pipeline uses a prompted LLM decomposer, a rule-based verifier (atomicity, entity, repetition checks), and a two-tier repairer (rule-based, then conditional LLM self-repair).

In practice

Adopt BGE-large cosine similarity for decomposition evaluation.
Implement conditional repair gating for high-quality LLM outputs.
Prioritize rule-based repair for atomicity and redundancy.

Topics

Claim Decomposition
Automated Fact-Checking
Semantic Metrics
Large Language Models
Convergence Analysis
NLP Evaluation

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.