Can LLMs Write Correct TLA+ Specifications? Evaluating Natural-Language-to-TLA+ Generation
Summary
A systematic evaluation assessed the capability of 30 large language models (LLMs) from eight families to generate correct TLA+ specifications from natural language. This study, the first of its kind, utilized a curated dataset of 205 TLA+ specifications, validating 2,600 open-weight model runs and 130 proprietary model runs with the SANY parser and TLC model checker. Results indicate LLMs achieved up to 26.6% syntactic correctness but only 8.6% semantic correctness, with successful generations limited to progressive prompting strategies. Surprisingly, model size did not correlate with quality; for instance, DeepSeek r1:8b surpassed its 70B variant. Code-specialized models consistently underperformed due to negative transfer. The research identified five recurring hallucination categories linked to training data biases, concluding that current LLMs require expert oversight for reliable TLA+ specification generation. The evaluation framework, code, and dataset are publicly released.
Key takeaway
For software engineers or AI scientists considering large language models for TLA+ specification generation, understand that current LLMs are unreliable for producing semantically correct outputs. You should anticipate significant expert oversight and manual correction, as models achieved only 8.6% semantic correctness. Prioritize progressive prompting strategies and be wary of code-specialized models, which underperform. Do not rely on LLMs alone for critical formal verification tasks.
Key insights
LLMs struggle significantly with semantic correctness in TLA+ specification generation, requiring expert oversight.
Principles
- Model size does not predict formal language generation quality.
- Reasoning alignment is crucial for formal language tasks.
- Code-specialized models can suffer negative transfer for formal languages.
Method
The study systematically evaluated 30 LLMs on 205 TLA+ specifications, validating outputs with SANY parser and TLC model checker across various prompting strategies.
In practice
- Use progressive prompting for TLA+ generation attempts.
- Prioritize reasoning alignment over raw model size.
- Avoid code-specialized LLMs for TLA+ tasks.
Topics
- TLA+
- Formal Verification
- Large Language Models
- Specification Synthesis
- Semantic Correctness
- Prompting Strategies
Best for: AI Scientist, Research Scientist, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.