Explicit Evidence Grounding via Structured Inline Citation Generation
Summary
The FullCite framework introduces a novel approach to generating structured inline citations for large language model (LLM) outputs, aiming to enhance factual and faithful generation in high-stakes domains. Unlike prior methods, FullCite simultaneously links each generated claim to its source document and the precise supporting evidence span. The framework employs three distinct strategies: prompt-based generation, constrained decoding using a citation grammar, and posthoc span alignment, which reconstructs citations by finding the most similar snippet (Jaccard similarity 0.7). Evaluated on ASQA, BioASQ, and ExpertQA benchmarks using Qwen3-8B and Gemma3-12b-it LLMs, FullCite demonstrates that while LLMs effectively identify relevant documents (high Doc-F1), they struggle with precise evidence span localization (low Snippet-F1). The posthoc strategy significantly improved Snippet-F1, for instance, from 12.80 to 61.87 for Qwen3-8B on ASQA. The study also identified issues like primacy bias, where 81.8% of BioASQ citations targeted only the first two of five context documents, and citation omission on binary yes/no questions.
Key takeaway
For NLP Engineers and AI Scientists developing RAG-based QA systems, FullCite's findings highlight the critical need to prioritize precise evidence span identification over mere document-level retrieval. You should consider implementing posthoc span alignment techniques, such as those using Jaccard similarity, to significantly improve the accuracy of verbatim evidence citations. Additionally, be aware of LLM biases like "lost-in-the-middle" and citation omission for yes/no questions, and design your attribution mechanisms to explicitly mitigate these challenges for more faithful and transparent outputs.
Key insights
FullCite improves LLM attribution by jointly generating document and evidence-span citations, revealing LLMs struggle with precise span localization.
Principles
- Joint document and span attribution enhances transparency.
- Posthoc alignment can significantly boost evidence span identification.
- LLMs exhibit primacy bias in document selection.
Method
FullCite uses prompt-based generation, constrained decoding via a finite-state automaton, or posthoc span alignment (Jaccard similarity 0.7) to generate structured inline citations linking claims to documents and verbatim evidence snippets.
In practice
- Implement posthoc span alignment for better snippet-level attribution.
- Address primacy bias by diversifying document selection strategies.
- Enforce attribution for binary questions to prevent citation omission.
Topics
- Attributed Question Answering
- Large Language Models
- Citation Generation
- Retrieval-Augmented Generation
- Evidence Span Identification
- LLM Evaluation
Best for: Research Scientist, AI Engineer, Machine Learning Engineer, AI Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.