TLA-Prover: Verifiable TLA+ Specification Synthesis via Preference-Optimized Low-Rank Adaptation
Summary
TLA-Prover is a 20-billion-parameter model designed for synthesizing verifiable TLA+ specifications, a formal language used for distributed systems and safety-critical protocols. Existing large language models (LLMs) struggle with TLA+ generation, with the best public baseline achieving only 26.6% syntactic parse and 8.6% semantic model-check success. TLA-Prover addresses this by combining supervised fine-tuning (SFT) on verified examples with repair-based group-relative policy optimization (GRPO), where the model learns to correct its own failed specifications using the TLC model checker as a direct reward signal. Evaluated on a held-out 30-problem benchmark, TLA-Prover achieved a 30% pass@1 rate at both Gold and Diamond tiers, representing a 3.5x improvement over the 8.6% untuned baseline. A direct preference optimization (DPO) variant reached 20% at Diamond, with Gold and Diamond tiers ensuring non-trivial property verification.
Key takeaway
For AI Engineers developing formal verification tools or safety-critical system specifications, TLA-Prover demonstrates a viable path to significantly improve LLM-generated TLA+ code. You should consider integrating repair-based policy optimization and direct model checker feedback into your training pipelines. This approach, which yielded a 3.5x improvement in verifiable specifications, offers a robust method to overcome semantic correctness challenges in formal language synthesis, reducing manual verification effort and enhancing system reliability.
Key insights
Combining SFT with repair-based policy optimization significantly improves formal specification synthesis verification.
Principles
- Direct model checker feedback enhances formal language generation.
- Learning to self-repair improves semantic correctness.
- Non-trivial property verification is crucial for robust evaluation.
Method
TLA-Prover trains with supervised fine-tuning, then uses repair-based group-relative policy optimization (GRPO) where the model fixes its own TLC-rejected specifications, using TLC directly for reward.
In practice
- Apply GRPO for formal language generation tasks.
- Use formal verification tools like TLC as a direct reward signal.
- Implement multi-tier evaluation for semantic correctness.
Topics
- TLA+
- Formal Verification
- LLM Fine-tuning
- Policy Optimization
- Specification Synthesis
- Model Checking
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.