Can LLMs Write Correct TLA+ Specifications? Evaluating Natural-Language-to-TLA+ Generation

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

A systematic evaluation assessed 30 Large Language Models across eight families for their ability to generate correct TLA+ specifications from natural language. The study utilized a curated dataset of 205 TLA+ specifications, validating outputs with the SANY parser for syntactic correctness and the TLC model checker for semantic correctness. Results indicate LLMs achieved up to 26.6% syntactic correctness but only 8.6% semantic correctness, with successes exclusively observed under progressive prompting. Notably, model size did not correlate with quality; for instance, DeepSeek r1:8b outperformed its 70B variant. Code-specialized models consistently underperformed due to negative transfer. Five recurring hallucination categories were identified, suggesting current LLMs require expert oversight for reliable TLA+ specification generation.

Key takeaway

For Machine Learning Engineers or AI Architects exploring LLMs for TLA+ specification synthesis, recognize that current models are not yet reliable for semantic correctness. You should prioritize progressive prompting strategies, as they were the only ones to achieve any semantic passes (8.6%). Furthermore, consider smaller, reasoning-oriented models like DeepSeek r1:8b over larger or code-specialized variants, and plan for robust post-processing and iterative verification with SANY and TLC to bridge the significant syntax-semantics gap.

Key insights

LLMs achieve only 8.6% semantic correctness for TLA+ specifications, primarily with progressive prompting.

Principles

Method

Systematically evaluate LLM-generated TLA+ specifications using SANY for syntax and TLC for semantics, across four prompting strategies (Few-Shot, Progressive, Fill-in-Middle, Half Completion).

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.