AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Mathematics & Computational Sciences · Depth: Expert, extended

Summary

AlgoVeri is a new benchmark for evaluating Large Language Models' (LLMs) ability to generate formally verified code, known as vericoding, for 77 classical algorithms. Unlike previous benchmarks, AlgoVeri enforces identical functional contracts across Dafny, Verus, and Lean, enabling a fair cross-paradigm comparison. Results show frontier models like Gemini-3 Flash achieve 40.3% success in Dafny, 24.7% in Verus, and 7.8% in Lean, highlighting significant capability gaps. The benchmark also reveals that frontier models effectively use iterative repair over 15 rounds to boost performance, while open models like GPT-OSS-120B saturate early. Error analysis indicates language design impacts repair trajectories, with Dafny allowing focus on logical correctness, while Verus and Lean present persistent syntactic and semantic barriers.

Key takeaway

For Machine Learning Engineers developing vericoding solutions, you should recognize the distinct challenges posed by different formal verification systems. Prioritize SMT-based tools like Dafny for higher success rates with current LLMs, especially for complex algorithms. For frontier models, implement iterative repair mechanisms to leverage their self-correction capabilities. Conversely, for open-weight models, allocate compute to parallel sampling rather than deep repair, as they saturate quickly. This approach will optimize your LLM's verification performance and address specific language barriers.

Key insights

Vericoding LLM performance varies significantly across formal verification systems and problem complexity, with frontier models excelling at iterative repair.

Principles

Method

AlgoVeri evaluates LLMs by prompting for algorithm implementation and proof artifacts, using multi-turn refinement with compiler feedback, followed by semantic validation.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.