Scientists built the hardest AI test ever and the results are surprising

2026-03-13 · Source: Artificial Intelligence News -- ScienceDaily · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, long

Summary

Researchers, including a professor from Texas A&M University, have developed "Humanity's Last Exam" (HLE), a new benchmark designed to challenge advanced AI systems that now ace traditional tests like MMLU. This massive 2,500-question assessment covers highly specialized topics across mathematics, humanities, natural sciences, and ancient languages. Nearly 1,000 experts worldwide contributed to crafting questions, with each problem carefully designed to have a clear, verifiable answer and to prevent simple internet searches. Crucially, any question solvable by current AI models was removed during development. Early results indicate that even powerful AI models struggle significantly, with GPT-4o scoring 2.7%, Claude 3.5 Sonnet 4.1%, OpenAI's o1 8%, and the most capable systems like Gemini 3.1 Pro and Claude Opus 4.6 achieving only 40-50% accuracy. The exam aims to identify the gap between AI performance and true expert-level human knowledge.

Key takeaway

For AI scientists and research scientists developing advanced models, the "Humanity's Last Exam" highlights a significant gap in true expert-level understanding. You should integrate this new benchmark into your evaluation processes to accurately gauge your models' capabilities beyond basic pattern recognition. This will help identify specific areas where AI still falls short, guiding future research towards building safer and more reliable systems with deeper contextual intelligence.

Key insights

New benchmarks are essential to accurately measure AI capabilities beyond pattern recognition and identify gaps in expert-level understanding.

Principles

AI intelligence requires depth, context, and specialized expertise.
Benchmarks must evolve as AI capabilities advance.

Method

A global team of nearly 1,000 experts created 2,500 specialized questions. Questions solvable by leading AI models were systematically removed to ensure the test remained challenging.

In practice

Evaluate AI systems with HLE for true expert-level assessment.
Focus AI development on depth and specialized context.

Topics

AI Benchmarking
AI Capabilities
Large Language Models
Expert-level Knowledge
AI Limitations

Best for: AI Scientist, Research Scientist, AI Researcher, AI Engineer, Policy Maker

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence News -- ScienceDaily.