An AI model programmed nonstop for 19 days on a single MirrorCode task that cost $2,600 to run

2026-06-26 · Source: The Decoder · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, quick

Summary

Epoch AI has introduced the MirrorCode benchmark, designed to evaluate AI models' capability to reconstruct entire software programs solely from their functional descriptions, without access to the original source code. Initial results show Claude Opus 4.7 as the top performer, achieving a 56 percent solve rate. This model successfully rebuilt a 16,000-line toolkit in just 14 hours. However, the benchmark also highlights a significant limitation: all tested models consistently fail when confronted with the most complex programming tasks. One notable instance involved an AI model running nonstop for 19 days on a single MirrorCode task, incurring a cost of \$2,600, underscoring the computational intensity and current limitations in tackling intricate code generation challenges.

Key takeaway

For AI Engineers developing code generation tools, you should integrate the MirrorCode benchmark to rigorously assess model capabilities. While Claude Opus 4.7 demonstrates strong performance on moderately sized projects, anticipate that current models will struggle with highly complex programming challenges. Focus your development efforts on improving robustness for intricate tasks and factor in substantial compute costs, like the \$2,600 for a 19-day run, when planning advanced code recreation projects.

Key insights

The MirrorCode benchmark reveals AI's current limits in complex code recreation despite impressive partial successes.

Principles

AI excels at recreating moderately complex codebases.
Significant challenges remain for highly complex programming tasks.
Code recreation benchmarks quantify AI programming limits.

Method

The MirrorCode benchmark evaluates AI by requiring models to recreate complete programs from functional descriptions, without original source code access.

In practice

Use Claude Opus 4.7 for code recreation tasks.
Expect failures on highly complex code generation.
Budget for significant compute costs on complex tasks.

Topics

MirrorCode Benchmark
AI Code Generation
Claude Opus 4.7
Program Reconstruction
AI Model Evaluation
Computational Costs

Best for: Research Scientist, AI Scientist, AI Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Decoder.