LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels
Summary
LeWorldModel (LeWM) is a novel Joint Embedding Predictive Architecture (JEPA) designed for stable, end-to-end learning of world models directly from raw pixels. Unlike existing JEPA methods that often rely on complex multi-term losses or pre-trained encoders, LeWM achieves stability using only two loss terms: a next-embedding prediction loss and a Sketched-Isotropic-Gaussian Regularizer (SIGReg) to prevent representation collapse. This approach reduces tunable hyperparameters from six to one. With 15 million parameters, LeWM can be trained on a single GPU in a few hours, enabling planning up to 48 times faster than foundation-model-based world models while maintaining competitive performance across diverse 2D and 3D control tasks. Its latent space also effectively encodes physical structure and reliably detects physically implausible events.
Key takeaway
For Machine Learning Engineers developing embodied AI agents, LeWorldModel offers a robust and efficient approach to learning world models. Its simplified two-term objective and single tunable hyperparameter significantly reduce training complexity and instability compared to prior JEPA methods. You should consider LeWM for applications requiring fast, competitive planning from raw pixel inputs, especially when computational resources are limited to a single GPU.
Key insights
LeWorldModel offers a stable, end-to-end JEPA for world modeling from pixels using a simple two-term loss to prevent representation collapse.
Principles
- Stable JEPA training is possible with minimal loss terms.
- Gaussian-distributed latent embeddings prevent representation collapse.
- Simpler loss objectives enhance training stability and efficiency.
Method
LeWM jointly optimizes a Vision Transformer encoder and a Transformer predictor. It uses an MSE prediction loss for next embeddings and SIGReg to enforce isotropic Gaussian latent distributions, followed by MPC with CEM for planning.
In practice
- Train 15M parameter world models on a single GPU in hours.
- Achieve 48x faster planning for continuous control tasks.
- Utilize SIGReg with a single effective hyperparameter for stability.
Topics
- Joint Embedding Predictive Architectures
- World Models
- Latent Space Learning
- Reinforcement Learning
- Model Predictive Control
- Representation Learning
- Vision Transformers
Code references
Best for: Research Scientist, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Robotics Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.