Can LLMs Write Correct TLA+ Specifications? Evaluating Natural-Language-to-TLA+ Generation

2026-06-05 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

A systematic evaluation assessed 30 Large Language Models across eight families for their ability to generate correct TLA+ specifications from natural language. The study utilized a curated dataset of 205 TLA+ specifications, validating outputs with the SANY parser for syntactic correctness and the TLC model checker for semantic correctness. Results indicate LLMs achieved up to 26.6% syntactic correctness but only 8.6% semantic correctness, with successes exclusively observed under progressive prompting. Notably, model size did not correlate with quality; for instance, DeepSeek r1:8b outperformed its 70B variant. Code-specialized models consistently underperformed due to negative transfer. Five recurring hallucination categories were identified, suggesting current LLMs require expert oversight for reliable TLA+ specification generation.

Key takeaway

For Machine Learning Engineers or AI Architects exploring LLMs for TLA+ specification synthesis, recognize that current models are not yet reliable for semantic correctness. You should prioritize progressive prompting strategies, as they were the only ones to achieve any semantic passes (8.6%). Furthermore, consider smaller, reasoning-oriented models like DeepSeek r1:8b over larger or code-specialized variants, and plan for robust post-processing and iterative verification with SANY and TLC to bridge the significant syntax-semantics gap.

Key insights

LLMs achieve only 8.6% semantic correctness for TLA+ specifications, primarily with progressive prompting.

Principles

Model size does not predict TLA+ quality.
Reasoning alignment is key for formal languages.
Code-specialized models show negative transfer.

Method

Systematically evaluate LLM-generated TLA+ specifications using SANY for syntax and TLC for semantics, across four prompting strategies (Few-Shot, Progressive, Fill-in-Middle, Half Completion).

In practice

Use progressive prompting for TLA+ generation.
Prefer smaller, reasoning-oriented LLMs.
Apply deterministic post-processing for syntax.

Topics

TLA+
Formal Methods
LLM Evaluation
Specification Synthesis
Prompt Engineering
Model Checking
LLM Hallucinations

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.