New math benchmark reveals AI models confidently solve problems that have no solution
Summary
A consortium of 64 mathematicians, including professors, PhD students, and IMO medalists, developed SOOHAK, a new benchmark for AI models comprising 439 original math tasks. This benchmark, created at Carnegie Mellon University, EleutherAI, and Seoul National University, aims to expose weaknesses in research-level math and the ability to recognize unsolvable problems. The SOOHAK dataset is split into a "Challenge" set of 340 graduate and research-level problems and a "Refusal" set of 99 intentionally flawed tasks. Initial tests show Google's Gemini 3 Pro scoring highest on the challenge set at 30 percent, followed by GPT-5 (5.1, 5.2) at 26 percent. On the refusal set, no model cleared 50 percent, with open-weight GLM-5 performing best. The full dataset will remain private until late 2026 to prevent training data contamination.
Key takeaway
For AI scientists developing advanced mathematical reasoning models, recognize that current scaling methods primarily enhance problem-solving, not the critical ability to identify unsolvable problems. Your development efforts should explicitly target "refusal" as a distinct optimization goal, potentially by incorporating training on flawed or contradictory problem sets, to build more robust and reliable mathematical AI systems.
Key insights
Current AI models struggle with research-level math and fail to recognize unsolvable problems, often guessing confidently.
Principles
- Scaling compute boosts solution rates, not refusal rates.
- Olympiad-style training transfers better than broad research depth.
Method
The SOOHAK benchmark creation involved submission, automated LLM checks, manual moderation, revisions, and final inclusion, with all problems written from scratch by human mathematicians.
In practice
- Prioritize refusal capabilities in next-gen AI math models.
- Focus training on competitive math formats for better performance.
Topics
- SOOHAK Benchmark
- Research-Level Math
- Unsolvable Problems
- AI Model Refusal
- Mathematical Reasoning
Best for: AI Scientist, Research Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The Decoder.