FVSpec: Real-World Property-Based Tests as Lean Challenges

2026-05-31 · Source: Artificial Intelligence · Field: Technology & Digital — Software Development & Engineering, Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

FVSpec is a new benchmark designed to evaluate AI models and agents on real-world formal software verification tasks. Researchers scraped 11,039 property-based tests (PBTs) from Python repositories, then automatically translated 2,772 of these (25%) into 9,415 Lean 4 specifications, retaining multiple formalization attempts per PBT. This translation process is complex, requiring modeling Python semantics in Lean, inferring logical properties from imperative PBTs, and navigating dependently-typed programming in Lean 4. The benchmark utilizes a three-agent LLM pipeline for transpiling PBTs into Lean specifications, and provides coverage and quality metrics, alongside baselines for proof generation using automated and model-based approaches. All associated code and data are open source, aiming to advance AI-assisted formal verification of real-world software, especially as AI increasingly generates code.

Key takeaway

For AI Engineers developing or evaluating models for software verification, FVSpec offers a critical benchmark to assess real-world performance. You should integrate this open-source dataset and evaluation framework to rigorously test your models' ability to translate imperative code into formal specifications and generate proofs. This directly addresses the growing need for robust AI-assisted verification as AI increasingly produces production code, guiding your development towards practical, verifiable solutions.

Key insights

FVSpec benchmarks AI for formal software verification by translating 11,039 Python property-based tests into 9,415 Lean 4 specifications.

Principles

Real-world formal verification is an underexplored AI challenge.
Translating imperative PBTs to formal specs is complex.
Retain multiple formalizations for quality metrics.

Method

A three-agent LLM pipeline scrapes 11,039 Python PBTs, then automatically translates 2,772 into 9,415 Lean 4 specifications, inferring logical properties and handling Lean 4 semantics.

In practice

Evaluate AI models using the FVSpec benchmark.
Utilize open-source PBTs and Lean specifications.

Topics

FVSpec
Formal Verification
Property-Based Tests
Lean 4
LLM Pipeline
AI-assisted Software Engineering

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.