Modality-Driven Search with Holistic Trace Judging for ARC-AGI-2
Summary
A new solver for the ARC-AGI-2 few-shot visual reasoning benchmark, named "Modality-Driven Search with Holistic Trace Judging," has achieved the highest verified score on its leaderboard. This solver reached 72.9 percent accuracy on the semi-private evaluation set, costing USD 38.99 per task, and outperformed leading standalone frontier models like GPT-5.2 Pro (54.2 percent) and Gemini 3 Pro (54.0 percent) by +18.7 percentage points. On the public evaluation set, it scored 76.1 percent at USD 19.69 per task. The system operates on two core principles: using reasoning modalities (text, image, code) as search operators to generate diverse candidate solutions, and employing a context-preserving holistic judge model that jointly compares all candidates within a single long-context prompt. This method effectively identifies correct minority hypotheses, a challenge for traditional self-consistency or majority voting. The research also highlights that prescriptive prompting templates and iterative refinement negatively impact hypothesis diversity and overall performance.
Key takeaway
For AI Scientists and Machine Learning Engineers developing LLM-based solvers for abstract reasoning tasks, especially visual reasoning, you should consider implementing modality-driven search with holistic trace judging. This approach, which generates diverse candidates across text, image, and code and then jointly evaluates them, significantly outperforms standalone frontier models and traditional voting methods by reliably identifying correct minority hypotheses. Prioritize hypothesis diversity and avoid overly prescriptive prompting templates to maximize performance.
Key insights
Diverse modality-driven candidate generation combined with holistic trace judging improves LLM performance on abstract reasoning by recovering minority correct answers.
Principles
- Treat reasoning modalities as search operators.
- Judge all candidate traces holistically.
- Avoid prescriptive prompting templates.
Method
The solver generates diverse reasoning candidates across text, image, and code modalities. A judge model then jointly compares all these candidates within a single long-context prompt to select the best solution.
In practice
- Generate candidates across text, image, and code.
- Employ a long-context judge for joint comparison.
- Prioritize hypothesis diversity over strict templates.
Topics
- ARC-AGI-2
- Large Language Models
- Visual Reasoning
- Modality-Driven Search
- Holistic Trace Judging
- Prompt Engineering
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.