New math benchmark reveals AI models confidently solve problems that have no solution

· Source: The Decoder · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, medium

Summary

A consortium of 64 mathematicians, including professors, PhD students, and IMO medalists, developed SOOHAK, a new benchmark for AI models comprising 439 original math tasks. This benchmark, created at Carnegie Mellon University, EleutherAI, and Seoul National University, aims to expose weaknesses in research-level math and the ability to recognize unsolvable problems. The SOOHAK dataset is split into a "Challenge" set of 340 graduate and research-level problems and a "Refusal" set of 99 intentionally flawed tasks. Initial tests show Google's Gemini 3 Pro scoring highest on the challenge set at 30 percent, followed by GPT-5 (5.1, 5.2) at 26 percent. On the refusal set, no model cleared 50 percent, with open-weight GLM-5 performing best. The full dataset will remain private until late 2026 to prevent training data contamination.

Key takeaway

For AI scientists developing advanced mathematical reasoning models, recognize that current scaling methods primarily enhance problem-solving, not the critical ability to identify unsolvable problems. Your development efforts should explicitly target "refusal" as a distinct optimization goal, potentially by incorporating training on flawed or contradictory problem sets, to build more robust and reliable mathematical AI systems.

Key insights

Current AI models struggle with research-level math and fail to recognize unsolvable problems, often guessing confidently.

Principles

Method

The SOOHAK benchmark creation involved submission, automated LLM checks, manual moderation, revisions, and final inclusion, with all problems written from scratch by human mathematicians.

In practice

Topics

Best for: AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Decoder.