Diffusion Transformer World-Action Model for AV Scene Prediction

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

A Diffusion Transformer (DiT) world-action model is presented for autonomous driving scene prediction, generating future scene latents up to 8 seconds ahead from current front-camera latents and ego-actions, which a frozen decoder renders to 256×256 frames. Evaluated on 150 held-out nuScenes, the system benchmarks visual encoders, demonstrating V-JEPA2 with temporal context reduces steering RMSE by 40% relative to single-frame encoders. The DiT model's effectiveness relies on four key ingredients: spatial tokens, the x_{0} prediction objective, residual anchoring, and sampling matched to target uncertainty. It achieves a KID score of 0.078 versus 0.375 for direct regression (4.8× better), addressing the perception-distortion tradeoff with a deployable train-derived calibration. The model exhibits strong action controllability (Spearman ρ=0.81) and a compact 1.7 M-parameter "jump" model recovers full ground-truth motion magnitude (1.02× GT).

Key takeaway

For Machine Learning Engineers developing autonomous driving world models, you should prioritize distribution metrics like FID/KID over distortion metrics (e.g., CosSim, SSIM) to accurately assess model realism. Implement diffusion models with residual anchoring and an x_{0} prediction objective, and consider the "jump" model reparameterization to achieve better temporal motion fidelity and action controllability in your systems.

Key insights

Diffusion models, evaluated with distribution metrics, offer superior perceptual realism and action controllability for AV world models.

Principles

Method

A latent Diffusion Transformer (DiT) predicts future scene latents from current camera latents and ego-actions, using residual anchoring and an x_{0}-prediction objective, then decodes to 256×256 frames.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.