AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms

2025-08-07 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Mathematics & Computational Sciences · Depth: Expert, extended

Summary

AlgoVeri is a new benchmark for evaluating Large Language Models' (LLMs) ability to generate formally verified code, known as vericoding, for 77 classical algorithms. Unlike previous benchmarks, AlgoVeri enforces identical functional contracts across Dafny, Verus, and Lean, enabling a fair cross-paradigm comparison. Results show frontier models like Gemini-3 Flash achieve 40.3% success in Dafny, 24.7% in Verus, and 7.8% in Lean, highlighting significant capability gaps. The benchmark also reveals that frontier models effectively use iterative repair over 15 rounds to boost performance, while open models like GPT-OSS-120B saturate early. Error analysis indicates language design impacts repair trajectories, with Dafny allowing focus on logical correctness, while Verus and Lean present persistent syntactic and semantic barriers.

Key takeaway

For Machine Learning Engineers developing vericoding solutions, you should recognize the distinct challenges posed by different formal verification systems. Prioritize SMT-based tools like Dafny for higher success rates with current LLMs, especially for complex algorithms. For frontier models, implement iterative repair mechanisms to leverage their self-correction capabilities. Conversely, for open-weight models, allocate compute to parallel sampling rather than deep repair, as they saturate quickly. This approach will optimize your LLM's verification performance and address specific language barriers.

Key insights

Vericoding LLM performance varies significantly across formal verification systems and problem complexity, with frontier models excelling at iterative repair.

Principles

Aligned specifications are crucial for fair vericoding benchmark comparisons.
SMT-based verifiers (Dafny) offer higher LLM success rates than ITPs (Lean).
Iterative repair is an emergent capability of frontier LLMs, not open models.

Method

AlgoVeri evaluates LLMs by prompting for algorithm implementation and proof artifacts, using multi-turn refinement with compiler feedback, followed by semantic validation.

In practice

Prioritize Dafny for LLM-assisted vericoding of complex algorithms.
Implement iterative repair loops for frontier LLMs to maximize verification success.
For open LLMs, prefer parallel sampling over deep iterative repair.

Topics

Verified Code Generation
LLM Benchmarking
Formal Verification
Dafny
Verus
Lean
Iterative Repair

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Software Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.