LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

2022-06-27 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Computer Vision · Depth: Expert, extended

Summary

LeWorldModel (LeWM) is a novel Joint Embedding Predictive Architecture (JEPA) designed for stable, end-to-end learning of world models directly from raw pixels. Unlike existing JEPA methods that often rely on complex multi-term losses or pre-trained encoders, LeWM achieves stability using only two loss terms: a next-embedding prediction loss and a Sketched-Isotropic-Gaussian Regularizer (SIGReg) to prevent representation collapse. This approach reduces tunable hyperparameters from six to one. With 15 million parameters, LeWM can be trained on a single GPU in a few hours, enabling planning up to 48 times faster than foundation-model-based world models while maintaining competitive performance across diverse 2D and 3D control tasks. Its latent space also effectively encodes physical structure and reliably detects physically implausible events.

Key takeaway

For Machine Learning Engineers developing embodied AI agents, LeWorldModel offers a robust and efficient approach to learning world models. Its simplified two-term objective and single tunable hyperparameter significantly reduce training complexity and instability compared to prior JEPA methods. You should consider LeWM for applications requiring fast, competitive planning from raw pixel inputs, especially when computational resources are limited to a single GPU.

Key insights

LeWorldModel offers a stable, end-to-end JEPA for world modeling from pixels using a simple two-term loss to prevent representation collapse.

Principles

Stable JEPA training is possible with minimal loss terms.
Gaussian-distributed latent embeddings prevent representation collapse.
Simpler loss objectives enhance training stability and efficiency.

Method

LeWM jointly optimizes a Vision Transformer encoder and a Transformer predictor. It uses an MSE prediction loss for next embeddings and SIGReg to enforce isotropic Gaussian latent distributions, followed by MPC with CEM for planning.

In practice

Train 15M parameter world models on a single GPU in hours.
Achieve 48x faster planning for continuous control tasks.
Utilize SIGReg with a single effective hyperparameter for stability.

Topics

Joint Embedding Predictive Architectures
World Models
Latent Space Learning
Reinforcement Learning
Model Predictive Control
Representation Learning
Vision Transformers

Code references

Best for: Research Scientist, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.