Modality-Driven Search with Holistic Trace Judging for ARC-AGI-2

2026-06-30 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A new solver for the ARC-AGI-2 few-shot visual reasoning benchmark, named "Modality-Driven Search with Holistic Trace Judging," has achieved the highest verified score on its leaderboard. This solver reached 72.9 percent accuracy on the semi-private evaluation set, costing USD 38.99 per task, and outperformed leading standalone frontier models like GPT-5.2 Pro (54.2 percent) and Gemini 3 Pro (54.0 percent) by +18.7 percentage points. On the public evaluation set, it scored 76.1 percent at USD 19.69 per task. The system operates on two core principles: using reasoning modalities (text, image, code) as search operators to generate diverse candidate solutions, and employing a context-preserving holistic judge model that jointly compares all candidates within a single long-context prompt. This method effectively identifies correct minority hypotheses, a challenge for traditional self-consistency or majority voting. The research also highlights that prescriptive prompting templates and iterative refinement negatively impact hypothesis diversity and overall performance.

Key takeaway

For AI Scientists and Machine Learning Engineers developing LLM-based solvers for abstract reasoning tasks, especially visual reasoning, you should consider implementing modality-driven search with holistic trace judging. This approach, which generates diverse candidates across text, image, and code and then jointly evaluates them, significantly outperforms standalone frontier models and traditional voting methods by reliably identifying correct minority hypotheses. Prioritize hypothesis diversity and avoid overly prescriptive prompting templates to maximize performance.

Key insights

Diverse modality-driven candidate generation combined with holistic trace judging improves LLM performance on abstract reasoning by recovering minority correct answers.

Principles

Treat reasoning modalities as search operators.
Judge all candidate traces holistically.
Avoid prescriptive prompting templates.

Method

The solver generates diverse reasoning candidates across text, image, and code modalities. A judge model then jointly compares all these candidates within a single long-context prompt to select the best solution.

In practice

Generate candidates across text, image, and code.
Employ a long-context judge for joint comparison.
Prioritize hypothesis diversity over strict templates.

Topics

ARC-AGI-2
Large Language Models
Visual Reasoning
Modality-Driven Search
Holistic Trace Judging
Prompt Engineering

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.