Nobody gets this right

· Source: David Shapiro · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Advanced, long

Summary

The article challenges the common misconception that language models (LLMs) cannot function as "world models" because "the world isn't made of words." The author argues this view is outdated, emphasizing that contemporary LLMs are increasingly multimodal, trained on diverse data including audio, video, images, and text, making "omni models" a more accurate descriptor. Several claims against LLMs as world models are refuted, such as the unpredictability of sensor data, the inability to predict pixels like tokens, and the semantic distinction between "generation" and "understanding" in action-conditioned models. The author also dismisses the idea that future AI companies will exclusively train world models on sensor data, pointing out that cognitive architectures for integrating multiple data streams have existed since the 1970s for autonomous systems like rockets.

Key takeaway

For AI architects and machine learning engineers evaluating advanced AI capabilities, recognize that the "world model" debate is evolving. Modern "omni models" integrate multimodal data, challenging the words-only limitation of traditional language models. Focus on an AI's predictive accuracy across diverse data types, as this demonstrates abstract understanding. Avoid outdated distinctions between generation and understanding, and consider established cognitive architecture principles for unifying complex sensor inputs in autonomous systems.

Key insights

The distinction between language models and world models is diminishing as AI becomes multimodal and adept at abstract mathematical representations.

Principles

In practice

Topics

Best for: AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by David Shapiro.