AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms
Summary
AlgoVeri is a new benchmark for evaluating Large Language Models' (LLMs) ability to generate formally verified code, known as vericoding, for 77 classical algorithms. Unlike previous benchmarks, AlgoVeri enforces identical functional contracts across Dafny, Verus, and Lean, enabling a fair cross-paradigm comparison. Results show frontier models like Gemini-3 Flash achieve 40.3% success in Dafny, 24.7% in Verus, and 7.8% in Lean, highlighting significant capability gaps. The benchmark also reveals that frontier models effectively use iterative repair over 15 rounds to boost performance, while open models like GPT-OSS-120B saturate early. Error analysis indicates language design impacts repair trajectories, with Dafny allowing focus on logical correctness, while Verus and Lean present persistent syntactic and semantic barriers.
Key takeaway
For Machine Learning Engineers developing vericoding solutions, you should recognize the distinct challenges posed by different formal verification systems. Prioritize SMT-based tools like Dafny for higher success rates with current LLMs, especially for complex algorithms. For frontier models, implement iterative repair mechanisms to leverage their self-correction capabilities. Conversely, for open-weight models, allocate compute to parallel sampling rather than deep repair, as they saturate quickly. This approach will optimize your LLM's verification performance and address specific language barriers.
Key insights
Vericoding LLM performance varies significantly across formal verification systems and problem complexity, with frontier models excelling at iterative repair.
Principles
- Aligned specifications are crucial for fair vericoding benchmark comparisons.
- SMT-based verifiers (Dafny) offer higher LLM success rates than ITPs (Lean).
- Iterative repair is an emergent capability of frontier LLMs, not open models.
Method
AlgoVeri evaluates LLMs by prompting for algorithm implementation and proof artifacts, using multi-turn refinement with compiler feedback, followed by semantic validation.
In practice
- Prioritize Dafny for LLM-assisted vericoding of complex algorithms.
- Implement iterative repair loops for frontier LLMs to maximize verification success.
- For open LLMs, prefer parallel sampling over deep iterative repair.
Topics
- Verified Code Generation
- LLM Benchmarking
- Formal Verification
- Dafny
- Verus
- Lean
- Iterative Repair
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.