Explicit Evidence Grounding via Structured Inline Citation Generation

2026-06-08 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

The FullCite framework introduces a novel approach to generating structured inline citations for large language model (LLM) outputs, aiming to enhance factual and faithful generation in high-stakes domains. Unlike prior methods, FullCite simultaneously links each generated claim to its source document and the precise supporting evidence span. The framework employs three distinct strategies: prompt-based generation, constrained decoding using a citation grammar, and posthoc span alignment, which reconstructs citations by finding the most similar snippet (Jaccard similarity 0.7). Evaluated on ASQA, BioASQ, and ExpertQA benchmarks using Qwen3-8B and Gemma3-12b-it LLMs, FullCite demonstrates that while LLMs effectively identify relevant documents (high Doc-F1), they struggle with precise evidence span localization (low Snippet-F1). The posthoc strategy significantly improved Snippet-F1, for instance, from 12.80 to 61.87 for Qwen3-8B on ASQA. The study also identified issues like primacy bias, where 81.8% of BioASQ citations targeted only the first two of five context documents, and citation omission on binary yes/no questions.

Key takeaway

For NLP Engineers and AI Scientists developing RAG-based QA systems, FullCite's findings highlight the critical need to prioritize precise evidence span identification over mere document-level retrieval. You should consider implementing posthoc span alignment techniques, such as those using Jaccard similarity, to significantly improve the accuracy of verbatim evidence citations. Additionally, be aware of LLM biases like "lost-in-the-middle" and citation omission for yes/no questions, and design your attribution mechanisms to explicitly mitigate these challenges for more faithful and transparent outputs.

Key insights

FullCite improves LLM attribution by jointly generating document and evidence-span citations, revealing LLMs struggle with precise span localization.

Principles

Joint document and span attribution enhances transparency.
Posthoc alignment can significantly boost evidence span identification.
LLMs exhibit primacy bias in document selection.

Method

FullCite uses prompt-based generation, constrained decoding via a finite-state automaton, or posthoc span alignment (Jaccard similarity 0.7) to generate structured inline citations linking claims to documents and verbatim evidence snippets.

In practice

Implement posthoc span alignment for better snippet-level attribution.
Address primacy bias by diversifying document selection strategies.
Enforce attribution for binary questions to prevent citation omission.

Topics

Attributed Question Answering
Large Language Models
Citation Generation
Retrieval-Augmented Generation
Evidence Span Identification
LLM Evaluation

Best for: Research Scientist, AI Engineer, Machine Learning Engineer, AI Scientist, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.