Citation Discipline in Spec-Driven Development: A Cross-Model Empirical Study of Output Determinism and Automated Hallucination Detection in LLM-Generated Code
Summary
An empirical study compared three Spec-Driven Development (SDD) frameworks—$traceSDD$, $Spec Kit$, and $OpenSpec$—for Large Language Model (LLM)-powered code generation. The research, using Claude Sonnet 4.6 (240 implementations) and GLM-5-turbo (600 implementations), measured output determinism and automated hallucination detection rate (TDR). Findings reveal a consistent trade-off: uncited conditions produced significantly higher determinism (Claude: $d=-0.76$, $p=0.003$; GLM: $d=-0.72$, $p<0.001$) than cited conditions. Conversely, only cited conditions enabled automated hallucination detection, achieving TDRs of 86.4% for Claude and 88.0% for GLM, with 0% for alternatives. $traceSDD$ (cited) also significantly outperformed $Spec Kit$ on determinism (Claude: $d=0.47$, $p=0.049$; GLM: $d=0.42$, $p=0.003$).
Key takeaway
For AI Engineers designing or implementing LLM-powered code generation systems, if your priority is robust code verifiability and automated hallucination detection, you should adopt spec-driven development frameworks that enforce mandatory per-line requirement citations. While this approach may reduce output determinism, it is essential for achieving high automated detection rates (e.g., 86.4%-88.0% TDR). Consider frameworks like $traceSDD$ to balance determinism with critical verifiability.
Key insights
Citation annotations in LLM-generated code trade output determinism for critical verifiability and automated hallucination detection.
Principles
- Mandatory per-line citations enable automated hallucination detection.
- The determinism-verifiability trade-off generalizes across LLM architectures.
- SDD frameworks differ in traceability enforcement and impact outcomes.
Method
Controlled empirical studies compared $traceSDD$, $Spec Kit$, and $OpenSpec$ using Claude Sonnet 4.6 and GLM-5-turbo to measure output determinism and automated hallucination detection rate (TDR).
In practice
- Cited conditions achieved 86.4%-88.0% automated hallucination detection.
- Uncited conditions consistently yielded higher output determinism.
- $traceSDD$ significantly improved determinism over $Spec Kit$.
Topics
- Spec-Driven Development
- LLM Code Generation
- Output Determinism
- Hallucination Detection
- Code Traceability
- Empirical Study
Best for: Research Scientist, AI Architect, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.