Diffusion Transformer World-Action Model for AV Scene Prediction

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

A Diffusion Transformer (DiT) world-action model is presented for autonomous driving scene prediction, generating future scene latents up to 8 seconds ahead from current front-camera latents and ego-actions, which a frozen decoder renders to 256×256 frames. Evaluated on 150 held-out nuScenes, the system benchmarks visual encoders, demonstrating V-JEPA2 with temporal context reduces steering RMSE by 40% relative to single-frame encoders. The DiT model's effectiveness relies on four key ingredients: spatial tokens, the x_{0} prediction objective, residual anchoring, and sampling matched to target uncertainty. It achieves a KID score of 0.078 versus 0.375 for direct regression (4.8× better), addressing the perception-distortion tradeoff with a deployable train-derived calibration. The model exhibits strong action controllability (Spearman ρ=0.81) and a compact 1.7 M-parameter "jump" model recovers full ground-truth motion magnitude (1.02× GT).

Key takeaway

For Machine Learning Engineers developing autonomous driving world models, you should prioritize distribution metrics like FID/KID over distortion metrics (e.g., CosSim, SSIM) to accurately assess model realism. Implement diffusion models with residual anchoring and an x_{0} prediction objective, and consider the "jump" model reparameterization to achieve better temporal motion fidelity and action controllability in your systems.

Key insights

Diffusion models, evaluated with distribution metrics, offer superior perceptual realism and action controllability for AV world models.

Principles

Distribution metrics (FID/KID) are crucial for evaluating AV world model realism.
Temporal context in visual encoders significantly improves ego-action prediction.
Residual anchoring and x_{0} objective are vital for DiT performance in compact latents.

Method

A latent Diffusion Transformer (DiT) predicts future scene latents from current camera latents and ego-actions, using residual anchoring and an x_{0}-prediction objective, then decodes to 256×256 frames.

In practice

Use V-JEPA2 with temporal context for improved steering RMSE.
Apply train-derived calibration for deployable diffusion model advantages.
Implement a "jump" model for recovering coherent forward motion.

Topics

Autonomous Driving
World Models
Diffusion Transformers
Scene Prediction
Perception-Distortion Tradeoff
Action Controllability

Code references

dlcv-team/latent-world-models-av

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.