Can LLMs Write Correct TLA+ Specifications? Evaluating Natural-Language-to-TLA+ Generation
Summary
A systematic evaluation assessed 30 Large Language Models across eight families for their ability to generate correct TLA+ specifications from natural language. The study utilized a curated dataset of 205 TLA+ specifications, validating outputs with the SANY parser for syntactic correctness and the TLC model checker for semantic correctness. Results indicate LLMs achieved up to 26.6% syntactic correctness but only 8.6% semantic correctness, with successes exclusively observed under progressive prompting. Notably, model size did not correlate with quality; for instance, DeepSeek r1:8b outperformed its 70B variant. Code-specialized models consistently underperformed due to negative transfer. Five recurring hallucination categories were identified, suggesting current LLMs require expert oversight for reliable TLA+ specification generation.
Key takeaway
For Machine Learning Engineers or AI Architects exploring LLMs for TLA+ specification synthesis, recognize that current models are not yet reliable for semantic correctness. You should prioritize progressive prompting strategies, as they were the only ones to achieve any semantic passes (8.6%). Furthermore, consider smaller, reasoning-oriented models like DeepSeek r1:8b over larger or code-specialized variants, and plan for robust post-processing and iterative verification with SANY and TLC to bridge the significant syntax-semantics gap.
Key insights
LLMs achieve only 8.6% semantic correctness for TLA+ specifications, primarily with progressive prompting.
Principles
- Model size does not predict TLA+ quality.
- Reasoning alignment is key for formal languages.
- Code-specialized models show negative transfer.
Method
Systematically evaluate LLM-generated TLA+ specifications using SANY for syntax and TLC for semantics, across four prompting strategies (Few-Shot, Progressive, Fill-in-Middle, Half Completion).
In practice
- Use progressive prompting for TLA+ generation.
- Prefer smaller, reasoning-oriented LLMs.
- Apply deterministic post-processing for syntax.
Topics
- TLA+
- Formal Methods
- LLM Evaluation
- Specification Synthesis
- Prompt Engineering
- Model Checking
- LLM Hallucinations
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.