Uneven Evolution of Cognition Across Generations of Generative AI Models
Summary
A psychometric framework was introduced to evaluate the cognitive capabilities of generative AI models, comparing them to human norms and tracking their evolution across generations. Initial evaluations using tasks adapted from the Wechsler Adult Intelligence Scale (WAIS-IV) revealed that leading multimodal models, including OpenAI's GPT-4 Turbo and GPT-4o, Google's Gemini Flash 1.5 and Pro 1.5, and Anthropic's Claude 3 Opus and Claude 3.5 Sonnet, achieved near-ceiling performance in verbal comprehension and working memory (>98th percentile) but near-floor performance in perceptual reasoning (<1st percentile). To track development beyond human-normed limits, the Artificial Intelligence Quotient (AIQ) Benchmark was developed and applied to six generations of Gemini models and two model families, showing significant but asymmetric performance gains. Abstract quantitative reasoning improved much faster when presented linguistically compared to visually, indicating an architectural bias towards language-based symbolic manipulation, while visual-perceptual organization remained largely stagnant.
Key takeaway
For AI Scientists and Research Scientists focused on AGI development, these findings highlight a critical imbalance: current scaling methods disproportionately advance linguistic abilities while visual-perceptual reasoning lags. You should investigate novel architectural designs that foster integrated world representations, moving beyond purely statistical pattern matching to achieve more balanced, human-like general intelligence. This requires a shift in focus from merely scaling data and compute to addressing fundamental cognitive bottlenecks.
Key insights
Generative AI models exhibit uneven cognitive development, excelling in language but struggling with visual-perceptual reasoning.
Principles
- AI cognitive growth is asymmetric, not uniform.
- Architectural biases favor language-based symbolic manipulation.
Method
A two-pronged psychometric approach was used: adapting WAIS-IV subtests for initial AI evaluation, then developing the scalable, AI-centric AIQ Benchmark with procedurally generated items to overcome human-normed ceiling effects.
In practice
- Use AIQ Benchmark for scalable AI cognitive assessment.
- Prioritize multimodal training for balanced intelligence.
- Address architectural biases for visual reasoning.
Topics
- Artificial General Intelligence
- Cognitive Assessment
- AIQ Benchmark
- Multimodal AI Models
- Perceptual Reasoning
Best for: AI Scientist, Research Scientist, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.