Can LLMs Write Correct TLA+ Specifications? Evaluating Natural-Language-to-TLA+ Generation

2026-06-04 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, quick

Summary

A systematic evaluation assessed the capability of 30 large language models (LLMs) from eight families to generate correct TLA+ specifications from natural language. This study, the first of its kind, utilized a curated dataset of 205 TLA+ specifications, validating 2,600 open-weight model runs and 130 proprietary model runs with the SANY parser and TLC model checker. Results indicate LLMs achieved up to 26.6% syntactic correctness but only 8.6% semantic correctness, with successful generations limited to progressive prompting strategies. Surprisingly, model size did not correlate with quality; for instance, DeepSeek r1:8b surpassed its 70B variant. Code-specialized models consistently underperformed due to negative transfer. The research identified five recurring hallucination categories linked to training data biases, concluding that current LLMs require expert oversight for reliable TLA+ specification generation. The evaluation framework, code, and dataset are publicly released.

Key takeaway

For software engineers or AI scientists considering large language models for TLA+ specification generation, understand that current LLMs are unreliable for producing semantically correct outputs. You should anticipate significant expert oversight and manual correction, as models achieved only 8.6% semantic correctness. Prioritize progressive prompting strategies and be wary of code-specialized models, which underperform. Do not rely on LLMs alone for critical formal verification tasks.

Key insights

LLMs struggle significantly with semantic correctness in TLA+ specification generation, requiring expert oversight.

Principles

Model size does not predict formal language generation quality.
Reasoning alignment is crucial for formal language tasks.
Code-specialized models can suffer negative transfer for formal languages.

Method

The study systematically evaluated 30 LLMs on 205 TLA+ specifications, validating outputs with SANY parser and TLC model checker across various prompting strategies.

In practice

Use progressive prompting for TLA+ generation attempts.
Prioritize reasoning alignment over raw model size.
Avoid code-specialized LLMs for TLA+ tasks.

Topics

TLA+
Formal Verification
Large Language Models
Specification Synthesis
Semantic Correctness
Prompting Strategies

Best for: AI Scientist, Research Scientist, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.