TLA-Prover: Verifiable TLA+ Specification Synthesis via Preference-Optimized Low-Rank Adaptation

2026-06-05 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

TLA-Prover, a 20-billion-parameter model built on gpt-oss-20b, significantly advances verifiable TLA+ specification synthesis for distributed systems. Addressing the 8.6% semantic model-check baseline of untuned LLMs, TLA-Prover employs a two-stage training pipeline: supervised fine-tuning on diamond-tier curated examples, followed by repair-based group-relative policy optimization. A critical four-tier validation hierarchy, culminating in the Diamond tier, ensures generated invariants are mutation-sensitive, preventing reward hacking from tautological properties. On a 30-problem held-out benchmark, TLA-Prover achieves a 30% pass rate at both Gold and Diamond tiers, representing a 3.5x improvement over the baseline. This performance confirms the effectiveness of verifier-guided training with robust anti-reward-hacking mechanisms.

Key takeaway

For AI Engineers developing LLMs for formal specification synthesis, you must integrate mutation-sensitive validation into your pipelines. Relying solely on basic model checker passes risks generating vacuous, tautological invariants that convey no useful protocol properties. Implement a "Diamond" tier check to ensure your generated specifications are genuinely meaningful and robust, significantly improving the trustworthiness of your verifiable code generation efforts.

Key insights

Verifier-guided LLM training for formal specifications necessitates mutation-sensitive validation to prevent reward hacking.

Principles

TLC pass alone is an exploitable reward signal.
Mutation testing prevents tautological invariants.
Diamond-tier data curation enhances SFT quality.

Method

Two-stage pipeline: SFT on diamond-tier curated data, then repair-based GRPO using TLC as a continuous, dense reward signal via a four-tier validation hierarchy.

In practice

Always run mutation testing on TLC-passing specs.
Use best-of-K sampling for candidate specs.
Prefer diamond-tier training data.

Topics

TLA+
Formal Verification
Large Language Models
Specification Synthesis
Reward Hacking
Mutation Testing
Preference Optimization

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.