Nobody gets this right

2026-06-07 · Source: David Shapiro · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Advanced, long

Summary

The article challenges the common misconception that language models (LLMs) cannot function as "world models" because "the world isn't made of words." The author argues this view is outdated, emphasizing that contemporary LLMs are increasingly multimodal, trained on diverse data including audio, video, images, and text, making "omni models" a more accurate descriptor. Several claims against LLMs as world models are refuted, such as the unpredictability of sensor data, the inability to predict pixels like tokens, and the semantic distinction between "generation" and "understanding" in action-conditioned models. The author also dismisses the idea that future AI companies will exclusively train world models on sensor data, pointing out that cognitive architectures for integrating multiple data streams have existed since the 1970s for autonomous systems like rockets.

Key takeaway

For AI architects and machine learning engineers evaluating advanced AI capabilities, recognize that the "world model" debate is evolving. Modern "omni models" integrate multimodal data, challenging the words-only limitation of traditional language models. Focus on an AI's predictive accuracy across diverse data types, as this demonstrates abstract understanding. Avoid outdated distinctions between generation and understanding, and consider established cognitive architecture principles for unifying complex sensor inputs in autonomous systems.

Key insights

The distinction between language models and world models is diminishing as AI becomes multimodal and adept at abstract mathematical representations.

Principles

Modern LLMs are multimodal "omni models."
Prediction accuracy indicates abstract understanding.
Cognitive architectures unify diverse data streams.

In practice

Consider multimodal AI for complex tasks.
Evaluate AI by predictive accuracy, not input type.
Integrate diverse sensor data using cognitive architectures.

Topics

World Models
Multimodal AI
Omni Models
Cognitive Architectures
Generative AI
Sensor Data Integration

Best for: AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by David Shapiro.