New math benchmark reveals AI models confidently solve problems that have no solution

2026-05-17 · Source: The Decoder · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, medium

Summary

A consortium of 64 mathematicians, including professors, PhD students, and IMO medalists, developed SOOHAK, a new benchmark for AI models comprising 439 original math tasks. This benchmark, created at Carnegie Mellon University, EleutherAI, and Seoul National University, aims to expose weaknesses in research-level math and the ability to recognize unsolvable problems. The SOOHAK dataset is split into a "Challenge" set of 340 graduate and research-level problems and a "Refusal" set of 99 intentionally flawed tasks. Initial tests show Google's Gemini 3 Pro scoring highest on the challenge set at 30 percent, followed by GPT-5 (5.1, 5.2) at 26 percent. On the refusal set, no model cleared 50 percent, with open-weight GLM-5 performing best. The full dataset will remain private until late 2026 to prevent training data contamination.

Key takeaway

For AI scientists developing advanced mathematical reasoning models, recognize that current scaling methods primarily enhance problem-solving, not the critical ability to identify unsolvable problems. Your development efforts should explicitly target "refusal" as a distinct optimization goal, potentially by incorporating training on flawed or contradictory problem sets, to build more robust and reliable mathematical AI systems.

Key insights

Current AI models struggle with research-level math and fail to recognize unsolvable problems, often guessing confidently.

Principles

Scaling compute boosts solution rates, not refusal rates.
Olympiad-style training transfers better than broad research depth.

Method

The SOOHAK benchmark creation involved submission, automated LLM checks, manual moderation, revisions, and final inclusion, with all problems written from scratch by human mathematicians.

In practice

Prioritize refusal capabilities in next-gen AI math models.
Focus training on competitive math formats for better performance.

Topics

SOOHAK Benchmark
Research-Level Math
Unsolvable Problems
AI Model Refusal
Mathematical Reasoning

Best for: AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Decoder.