AGI Is Not Multimodal

2025-06-04 · Source: The Gradient · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Advanced, long

Summary

The article argues that current generative AI models, despite their apparent intelligence, do not possess a true understanding of the physical world, which is crucial for Artificial General Intelligence (AGI). It posits that these models primarily learn "bags of heuristics" or "syntax" through next-token prediction, rather than inducing genuine world models. The author contends that the multimodal approach to AGI, which attempts to combine narrow, modality-specific models, is unlikely to succeed because it unnaturally severs deep connections between modalities and trains models to copy human conceptual structures rather than forming novel concepts. Instead, the article advocates for AGI approaches that prioritize embodiment and interaction with the environment, allowing modality-specific processing to emerge naturally, and emphasizes that AGI must be general across all domains, including physical reality.

Key takeaway

For AI researchers and scientists aiming to build AGI, your focus should shift from scaling multimodal models to developing systems grounded in embodied interaction. Recognize that current LLMs primarily learn syntax and heuristics, not true world models, which limits their capacity for physical reasoning. Prioritize designing architectures where modality-specific processing emerges from unified perception and action systems, fostering flexible cognitive ability over mere efficiency.

Key insights

True AGI requires embodied understanding and interaction with the physical world, not just symbolic manipulation.

Principles

AGI must be general across all domains.
Physical world understanding is prerequisite for AGI.
Scale alone does not guarantee AGI.

Method

Pursue intelligence approaches treating embodiment and environmental interaction as primary, allowing modality-centered processing to emerge, rather than gluing modalities together.

In practice

Design systems where modalities naturally fuse.
Process images, text, video with same perception system.
Use unified action systems for text and object manipulation.

Topics

Artificial General Intelligence
Embodied AI
Large Language Models
Multimodal AI
World Models

Best for: AI Researcher, AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Gradient.