Uneven Evolution of Cognition Across Generations of Generative AI Models

2026-05-11 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

A psychometric framework was introduced to evaluate the cognitive capabilities of generative AI models, comparing them to human norms and tracking their evolution across generations. Initial evaluations using tasks adapted from the Wechsler Adult Intelligence Scale (WAIS-IV) revealed that leading multimodal models, including OpenAI's GPT-4 Turbo and GPT-4o, Google's Gemini Flash 1.5 and Pro 1.5, and Anthropic's Claude 3 Opus and Claude 3.5 Sonnet, achieved near-ceiling performance in verbal comprehension and working memory (>98th percentile) but near-floor performance in perceptual reasoning (<1st percentile). To track development beyond human-normed limits, the Artificial Intelligence Quotient (AIQ) Benchmark was developed and applied to six generations of Gemini models and two model families, showing significant but asymmetric performance gains. Abstract quantitative reasoning improved much faster when presented linguistically compared to visually, indicating an architectural bias towards language-based symbolic manipulation, while visual-perceptual organization remained largely stagnant.

Key takeaway

For AI Scientists and Research Scientists focused on AGI development, these findings highlight a critical imbalance: current scaling methods disproportionately advance linguistic abilities while visual-perceptual reasoning lags. You should investigate novel architectural designs that foster integrated world representations, moving beyond purely statistical pattern matching to achieve more balanced, human-like general intelligence. This requires a shift in focus from merely scaling data and compute to addressing fundamental cognitive bottlenecks.

Key insights

Generative AI models exhibit uneven cognitive development, excelling in language but struggling with visual-perceptual reasoning.

Principles

AI cognitive growth is asymmetric, not uniform.
Architectural biases favor language-based symbolic manipulation.

Method

A two-pronged psychometric approach was used: adapting WAIS-IV subtests for initial AI evaluation, then developing the scalable, AI-centric AIQ Benchmark with procedurally generated items to overcome human-normed ceiling effects.

In practice

Use AIQ Benchmark for scalable AI cognitive assessment.
Prioritize multimodal training for balanced intelligence.
Address architectural biases for visual reasoning.

Topics

Artificial General Intelligence
Cognitive Assessment
AIQ Benchmark
Multimodal AI Models
Perceptual Reasoning

Best for: AI Scientist, Research Scientist, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.