AGI Is Not Multimodal
Summary
The article argues that current generative AI models, despite their apparent intelligence, do not possess a true understanding of the physical world, which is crucial for Artificial General Intelligence (AGI). It posits that these models primarily learn "bags of heuristics" or "syntax" through next-token prediction, rather than inducing genuine world models. The author contends that the multimodal approach to AGI, which attempts to combine narrow, modality-specific models, is unlikely to succeed because it unnaturally severs deep connections between modalities and trains models to copy human conceptual structures rather than forming novel concepts. Instead, the article advocates for AGI approaches that prioritize embodiment and interaction with the environment, allowing modality-specific processing to emerge naturally, and emphasizes that AGI must be general across all domains, including physical reality.
Key takeaway
For AI researchers and scientists aiming to build AGI, your focus should shift from scaling multimodal models to developing systems grounded in embodied interaction. Recognize that current LLMs primarily learn syntax and heuristics, not true world models, which limits their capacity for physical reasoning. Prioritize designing architectures where modality-specific processing emerges from unified perception and action systems, fostering flexible cognitive ability over mere efficiency.
Key insights
True AGI requires embodied understanding and interaction with the physical world, not just symbolic manipulation.
Principles
- AGI must be general across all domains.
- Physical world understanding is prerequisite for AGI.
- Scale alone does not guarantee AGI.
Method
Pursue intelligence approaches treating embodiment and environmental interaction as primary, allowing modality-centered processing to emerge, rather than gluing modalities together.
In practice
- Design systems where modalities naturally fuse.
- Process images, text, video with same perception system.
- Use unified action systems for text and object manipulation.
Topics
- Artificial General Intelligence
- Embodied AI
- Large Language Models
- Multimodal AI
- World Models
Best for: AI Researcher, AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The Gradient.