TLA-Prover: Verifiable TLA+ Specification Synthesis via Preference-Optimized Low-Rank Adaptation
Summary
TLA-Prover, a 20-billion-parameter model built on gpt-oss-20b, significantly advances verifiable TLA+ specification synthesis for distributed systems. Addressing the 8.6% semantic model-check baseline of untuned LLMs, TLA-Prover employs a two-stage training pipeline: supervised fine-tuning on diamond-tier curated examples, followed by repair-based group-relative policy optimization. A critical four-tier validation hierarchy, culminating in the Diamond tier, ensures generated invariants are mutation-sensitive, preventing reward hacking from tautological properties. On a 30-problem held-out benchmark, TLA-Prover achieves a 30% pass rate at both Gold and Diamond tiers, representing a 3.5x improvement over the baseline. This performance confirms the effectiveness of verifier-guided training with robust anti-reward-hacking mechanisms.
Key takeaway
For AI Engineers developing LLMs for formal specification synthesis, you must integrate mutation-sensitive validation into your pipelines. Relying solely on basic model checker passes risks generating vacuous, tautological invariants that convey no useful protocol properties. Implement a "Diamond" tier check to ensure your generated specifications are genuinely meaningful and robust, significantly improving the trustworthiness of your verifiable code generation efforts.
Key insights
Verifier-guided LLM training for formal specifications necessitates mutation-sensitive validation to prevent reward hacking.
Principles
- TLC pass alone is an exploitable reward signal.
- Mutation testing prevents tautological invariants.
- Diamond-tier data curation enhances SFT quality.
Method
Two-stage pipeline: SFT on diamond-tier curated data, then repair-based GRPO using TLC as a continuous, dense reward signal via a four-tier validation hierarchy.
In practice
- Always run mutation testing on TLC-passing specs.
- Use best-of-K sampling for candidate specs.
- Prefer diamond-tier training data.
Topics
- TLA+
- Formal Verification
- Large Language Models
- Specification Synthesis
- Reward Hacking
- Mutation Testing
- Preference Optimization
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.