OmniTraffic: A Controllable Generation Pipeline and Benchmark for Spatio-Temporal Traffic Reasoning
Summary
OmniTraffic is introduced as a controllable generation pipeline and benchmark designed for spatio-temporal traffic reasoning, addressing limitations in existing benchmarks that focus on passive visual recognition. It reconstructs 12 real-world intersections into editable 3D environments, supplemented by surveillance footage from two countries, to support both controlled and natural-condition evaluation. The benchmark defines a three-level task hierarchy covering scene perception, multi-view and temporal reasoning, and decision support. Utilizing structured traffic metadata, OmniTraffic generates 8M VQA samples and includes a 3K human-verified test set, covering vehicle states, lane functions, view-BEV correspondence, temporal dynamics, and signal-phase analysis. Evaluation of eleven frontier MLLMs revealed a substantial human-model gap, particularly in topology-grounded and spatio-temporal reasoning tasks. Fine-tuning a lightweight MLLM on simulated OmniTraffic data improved performance on real-world traffic scenes, demonstrating the value of simulation-generated supervision.
Key takeaway
For AI Scientists and Machine Learning Engineers developing MLLMs for autonomous driving or traffic management, this work highlights a critical gap: current models significantly underperform humans in spatio-temporal and topology-grounded traffic reasoning. You should integrate OmniTraffic into your evaluation pipelines to rigorously test model capabilities beyond basic recognition. Leverage its extensible pipeline and simulation-generated supervision to fine-tune models, addressing these specific reasoning deficiencies for more robust real-world deployment.
Key insights
OmniTraffic is a benchmark and pipeline for spatio-temporal traffic reasoning, revealing significant MLLM gaps in complex traffic understanding.
Principles
- Traffic reasoning requires structure-aware and spatio-temporal evaluation.
- Simulation-generated data enhances real-world MLLM performance.
- Existing MLLMs struggle with topology-grounded and temporal tasks.
Method
OmniTraffic reconstructs 12 real-world intersections into editable 3D environments, generating multi-view VQA samples with structured metadata for diverse traffic scenarios and a three-level task hierarchy.
In practice
- Use OmniTraffic to evaluate MLLMs on complex traffic scenarios.
- Fine-tune MLLMs with simulated traffic data for real-world gains.
- Configure intersections, camera views, and rare events for specific tests.
Topics
- OmniTraffic
- Spatio-Temporal Reasoning
- Traffic Simulation
- Multimodal Large Language Models
- Autonomous Driving
- Computer Vision Benchmarking
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.