LeanMarathon: Toward Reliable AI Co-Mathematicians through Long-Horizon Lean Autoformalization

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences, Software Development & Engineering · Depth: Expert, extended

Summary

LeanMarathon is a multi-agent harness designed for reliable, long-horizon autoformalization of research mathematics in Lean 4. It addresses challenges like statement drift and dependency tangles in large-scale formalization by employing an "evolving blueprint" that functions as a formal proof skeleton, natural-language proof graph, and shared system of record. Four contract-scoped agents—Blueprinter, Target-Reviewer, Worker, and Refiner—collaborate under a two-stage orchestrator that first stabilizes target fidelity through adversarial review, then discharges the proof DAG from its dynamic leaves upward in parallel, CI-gated rounds. This system successfully formalized all seven target theorems across two research papers covering four Erdős problems (#1051, #1196, #164, #1217), proving 258 lemmas and theorems. Total costs ranged from \$189 to \$624 per run, significantly outperforming a commercial single-agent baseline.

Key takeaway

For AI Architects designing systems for long-horizon formal verification, you should prioritize multi-agent harness designs over monolithic approaches. LeanMarathon demonstrates that decomposing complex tasks into contract-scoped agents with deterministic CI gates prevents goal drift, context rot, and compute exhaustion, enabling reliable formalization of entire research papers. Implement external verification and bounded agent scopes to ensure your systems remain coherent and recoverable across extended operations.

Key insights

Long-horizon autoformalization requires durable multi-agent harnesses with fault containment to ensure reliability and prevent drift.

Principles

Method

A multi-agent harness uses an evolving blueprint and a two-stage orchestrator for adversarial target review, then parallel, CI-gated proof discharge from dynamic DAG leaves.

In practice

Topics

Code references

Best for: AI Scientist, Research Scientist, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.