A classic brain test exposed AI's biggest weakness

2026-06-10 · Source: Artificial Intelligence News -- ScienceDaily · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Social Sciences & Behavioral Studies · Depth: Novice, short

Summary

New research led by Suketu Patel exposed a significant weakness in leading AI models, including GPT-4o, Claude 3.5 Sonnet, GPT-5, Claude Opus 4.1, and Gemini 2.5, using a classic psychological experiment called the Stroop task. While these large language models performed well on short lists of five color words, their accuracy dramatically declined on longer sequences. For instance, GPT-4o's accuracy fell from 91% with five words to 15% with forty words, and Claude 3.5 Sonnet dropped to 24% on forty-word lists after stable performance up to twenty words. The models struggled to maintain focus on identifying ink colors, instead defaulting to reading the words, a heavily trained response. This inability to consistently suppress distractions and maintain cognitive control over extended, demanding tasks highlights a fundamental limitation in current AI systems compared to human attention.

Key takeaway

For Machine Learning Engineers designing AI applications requiring sustained focus, recognize that current LLMs like GPT-4o and Claude 3.5 Sonnet exhibit significant performance degradation on longer, cognitively demanding tasks. Your systems may struggle to maintain instructions and suppress default behaviors when faced with extended or conflicting inputs. Prioritize robust testing for attention span and cognitive control, especially in critical applications where consistent focus is paramount.

Key insights

Current LLMs struggle with sustained attention and cognitive control on demanding, longer-sequence tasks.

Principles

LLM performance degrades with task length.
AI struggles to suppress dominant responses.
Human attention differs from machine attention.

Method

The study applied the Stroop task, presenting color words in conflicting ink colors, to measure AI models' ability to identify ink color over word content across varying list lengths.

In practice

Test AI models with extended, conflicting inputs.
Design tasks to minimize default response reliance.
Evaluate LLMs for sustained cognitive control.

Topics

Large Language Models
Cognitive Control
Stroop Task
AI Attention
Model Performance Degradation
GPT-4o
Claude 3.5 Sonnet

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence News -- ScienceDaily.