84.0% on ARC-AGI2 (840/1000) using LLM program synthesis + deterministic verification — no fine-tuning, no neural search

· Source: Machine Learning ML & Generative AI News · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, medium

Summary

A student in Kyoto, Japan, achieved an 84.0% score (840/1000 tasks) on the ARC-AGI2 training set by combining 127,000 lines of hand-crafted symbolic solvers with a Claude-powered program synthesis pipeline. The system operates in two stages: an initial set of 30+ specialized Python solvers, which plateaued at 24.4% (244/1000 tasks), and a subsequent LLM program synthesis stage. For unsolved tasks, Claude Sonnet 4.5 generates Python `transform` functions, which are then deterministically verified by an external Python script against all training examples. Claude Opus 4 orchestrates the process, batching tasks and managing parallel Sonnet sub-agents. This hybrid approach, which avoids fine-tuning or neural search, demonstrates a 78.8% success rate on previously unsolved tasks, with the full pipeline processing 1000 tasks in approximately 3 hours on a MacBook.

Key takeaway

For AI Scientists and Research Scientists developing solutions for complex reasoning benchmarks like ARC-AGI2, consider adopting a neurosymbolic architecture. Your team should prioritize using LLMs for program synthesis and pair this with robust, deterministic verification to ensure accuracy and mitigate hallucination, rather than relying on direct model predictions. This approach can significantly improve performance on tasks requiring precise, verifiable outputs, even if it introduces a generalization gap on evaluation sets.

Key insights

Combining LLM program synthesis with deterministic verification significantly boosts performance on complex reasoning tasks.

Principles

Method

An LLM generates Python code for unsolved tasks, which is then executed and deterministically verified against all training examples. Accepted code must be pixel-perfect across all examples.

In practice

Topics

Code references

Best for: AI Scientist, Research Scientist, AI Researcher, AI Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.