Citation Discipline in Spec-Driven Development: A Cross-Model Empirical Study of Output Determinism and Automated Hallucination Detection in LLM-Generated Code
Summary
Spec-Driven Development (SDD) frameworks guide Large Language Model (LLM)-powered code generation, but differ in traceability enforcement. A cross-model empirical study compared three SDD frameworks: traceSDD (mandatory per-line citations), Spec Kit (artifact-level traceability), and OpenSpec (post-hoc external trace maps). Using Claude Sonnet 4.6 (N=20, 240 implementations) and GLM-5-turbo (N=50, 600 implementations), the study measured output determinism (lexical similarity) and automated hallucination detection rate (TDR). Findings reveal a consistent trade-off: uncited conditions produce significantly higher determinism (Claude: d=-0.76, p=0.003; GLM: d=-0.72, p<0.001), while only the cited condition enables automated hallucination detection (TDR: Claude 86.4%, GLM 88.0%, vs 0% for alternatives, FPR=0%). traceSDD (cited) significantly outperforms Spec Kit on determinism (Claude: d=0.47, p=0.049; GLM: d=0.42, p=0.003) but not OpenSpec. The determinism penalty is most pronounced for easy and small tasks.
Key takeaway
For AI Architects or Machine Learning Engineers building production systems, you should adopt specification-driven development frameworks with mandatory inline citations like traceSDD. This enables 86-88% automated hallucination detection with 0% false positives, a critical capability for regulated or high-stakes domains. While this introduces a moderate determinism cost, the verifiability benefits outweigh it. Consider supplementing the orphan-REQ check with an uncited-line check for maximum coverage.
Key insights
Mandatory inline code citations in LLM-generated code trade output determinism for automated hallucination detection.
Principles
- Citation annotations introduce lexical variability.
- Structured specifications anchor LLM output determinism.
- Hallucination detection requires in-code annotations.
Method
The study compared traceSDD, Spec Kit, and OpenSpec using Claude Sonnet 4.6 and GLM-5-turbo. It measured lexical similarity (LSS) for determinism and traceability detection rate (TDR) for hallucination detection.
In practice
- Use traceSDD with citations for high-stakes code verification.
- Employ uncited REQ-format for rapid prototyping.
- Supplement orphan-REQ check with uncited-line check.
Topics
- LLM Code Generation
- Spec-Driven Development
- Requirements Traceability
- Hallucination Detection
- Output Determinism
- Claude Sonnet 4.6
- GLM-5-turbo
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.