Latent-WAM: Latent World Action Modeling for End-to-End Autonomous Driving

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Advanced, quick

Summary

Latent-WAM is an end-to-end autonomous driving framework designed for efficient trajectory planning using spatially-aware and dynamics-informed latent world representations. It addresses limitations in existing world-model-based planners, which often struggle with inadequate representation compression, limited spatial understanding, and underutilized temporal dynamics, especially under data and compute constraints. Latent-WAM features a Spatial-Aware Compressive World Encoder (SCWE) that distills geometric knowledge from a foundation model and compresses multi-view images into compact scene tokens. It also includes a Dynamic Latent World Model (DLWM) that uses a causal Transformer to predict future world states autoregressively based on historical visual and motion data. The framework achieved state-of-the-art results on NAVSIM v2 and HUGSIM, scoring 89.3 EPDMS and 28.9 HD-Score respectively, outperforming prior perception-free methods with a 104M-parameter model and less training data.

Key takeaway

For research scientists developing autonomous driving systems, Latent-WAM offers a compelling approach to improve planning efficiency and performance. Its architecture, which combines a Spatial-Aware Compressive World Encoder and a Dynamic Latent World Model, demonstrates superior results with reduced data and compute. You should consider integrating similar spatially-aware compression and causal temporal modeling techniques to enhance your own end-to-end driving frameworks, especially when operating under resource constraints.

Key insights

Latent-WAM improves autonomous driving planning via efficient, spatially-aware, and dynamics-informed latent world models.

Principles

Method

Latent-WAM uses a Spatial-Aware Compressive World Encoder (SCWE) for image compression and a Dynamic Latent World Model (DLWM) with a causal Transformer for autoregressive future state prediction.

In practice

Topics

Best for: Research Scientist, AI Researcher, AI Scientist, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.