Citation Discipline in Spec-Driven Development: A Cross-Model Empirical Study of Output Determinism and Automated Hallucination Detection in LLM-Generated Code

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Research Methodology & Innovation · Depth: Expert, extended

Summary

Spec-Driven Development (SDD) frameworks guide Large Language Model (LLM)-powered code generation, but differ in traceability enforcement. A cross-model empirical study compared three SDD frameworks: traceSDD (mandatory per-line citations), Spec Kit (artifact-level traceability), and OpenSpec (post-hoc external trace maps). Using Claude Sonnet 4.6 (N=20, 240 implementations) and GLM-5-turbo (N=50, 600 implementations), the study measured output determinism (lexical similarity) and automated hallucination detection rate (TDR). Findings reveal a consistent trade-off: uncited conditions produce significantly higher determinism (Claude: d=-0.76, p=0.003; GLM: d=-0.72, p<0.001), while only the cited condition enables automated hallucination detection (TDR: Claude 86.4%, GLM 88.0%, vs 0% for alternatives, FPR=0%). traceSDD (cited) significantly outperforms Spec Kit on determinism (Claude: d=0.47, p=0.049; GLM: d=0.42, p=0.003) but not OpenSpec. The determinism penalty is most pronounced for easy and small tasks.

Key takeaway

For AI Architects or Machine Learning Engineers building production systems, you should adopt specification-driven development frameworks with mandatory inline citations like traceSDD. This enables 86-88% automated hallucination detection with 0% false positives, a critical capability for regulated or high-stakes domains. While this introduces a moderate determinism cost, the verifiability benefits outweigh it. Consider supplementing the orphan-REQ check with an uncited-line check for maximum coverage.

Key insights

Mandatory inline code citations in LLM-generated code trade output determinism for automated hallucination detection.

Principles

Method

The study compared traceSDD, Spec Kit, and OpenSpec using Claude Sonnet 4.6 and GLM-5-turbo. It measured lexical similarity (LSS) for determinism and traceability detection rate (TDR) for hallucination detection.

In practice

Topics

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.