84.0% on ARC-AGI2 (840/1000) using LLM program synthesis + deterministic verification — no fine-tuning, no neural search

2026-03-01 · Source: Machine Learning ML & Generative AI News · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, medium

Summary

A student in Kyoto, Japan, achieved an 84.0% score (840/1000 tasks) on the ARC-AGI2 training set by combining 127,000 lines of hand-crafted symbolic solvers with a Claude-powered program synthesis pipeline. The system operates in two stages: an initial set of 30+ specialized Python solvers, which plateaued at 24.4% (244/1000 tasks), and a subsequent LLM program synthesis stage. For unsolved tasks, Claude Sonnet 4.5 generates Python `transform` functions, which are then deterministically verified by an external Python script against all training examples. Claude Opus 4 orchestrates the process, batching tasks and managing parallel Sonnet sub-agents. This hybrid approach, which avoids fine-tuning or neural search, demonstrates a 78.8% success rate on previously unsolved tasks, with the full pipeline processing 1000 tasks in approximately 3 hours on a MacBook.

Key takeaway

For AI Scientists and Research Scientists developing solutions for complex reasoning benchmarks like ARC-AGI2, consider adopting a neurosymbolic architecture. Your team should prioritize using LLMs for program synthesis and pair this with robust, deterministic verification to ensure accuracy and mitigate hallucination, rather than relying on direct model predictions. This approach can significantly improve performance on tasks requiring precise, verifiable outputs, even if it introduces a generalization gap on evaluation sets.

Key insights

Combining LLM program synthesis with deterministic verification significantly boosts performance on complex reasoning tasks.

Principles

Deterministic verification catches LLM hallucinations.
LLMs excel at code generation, not direct grid prediction.
Hybrid neurosymbolic systems overcome plateaus.

Method

An LLM generates Python code for unsolved tasks, which is then executed and deterministically verified against all training examples. Accepted code must be pixel-perfect across all examples.

In practice

Use LLMs as code generators, not direct solvers.
Implement strict, deterministic verification for LLM outputs.
Combine symbolic solvers with LLM synthesis for complex problems.

Topics

ARC-AGI2
LLM Program Synthesis
Deterministic Verification
Neurosymbolic AI
Claude Sonnet

Code references

Ag3497120/verantyx-v6

Best for: AI Scientist, Research Scientist, AI Researcher, AI Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.