SToRe3D: Sparse Token Relevance in ViTs for Efficient Multi-View 3D Object Detection
Summary
SToRe3D is a novel relevance-aligned sparsity framework designed to enhance the efficiency of Vision Transformers (ViTs) for multi-view 3D object detection, particularly in autonomous driving. It addresses the high inference latency caused by dense token and query processing across multiple views and large 3D regions. Unlike prior methods that focus on 2D vision or isolated modalities, SToRe3D jointly selects 2D image tokens and 3D object queries, storing filtered features for reactivation. It employs mutual 2D–3D relevance heads to allocate compute to driving-critical content, supervised by a planner-inspired future interaction corridor. Evaluated on nuScenes and the new nuScenes-Relevance benchmark, SToRe3D achieves up to 3x faster inference with marginal accuracy loss, enabling real-time ViT-based 3D detection while maintaining accuracy on planning-critical agents, reaching approximately 18 FPS with a ViT-B backbone.
Key takeaway
For research scientists developing real-time 3D object detection systems for autonomous driving, SToRe3D offers a compelling approach to overcome latency bottlenecks. You should consider implementing joint 2D token and 3D query sparsity, guided by planning-aligned relevance, to achieve significant speedups without sacrificing critical detection accuracy. This method allows for real-time performance on ViT-based architectures, crucial for safety-critical applications.
Key insights
Jointly sparsifying 2D image tokens and 3D object queries with relevance-aligned filtering significantly boosts multi-view 3D detection efficiency.
Principles
- Prioritize compute on planning-critical agents.
- Joint 2D-3D sparsity is more effective than isolated pruning.
- Store-reactivate buffers mitigate information loss from aggressive pruning.
Method
SToRe3D uses mutual 2D–3D relevance heads, supervised by future interaction corridors, to hierarchically filter and store low-relevance tokens/queries, reactivating them at the final layer to preserve context.
In practice
- Implement Gumbel-softmax TopK for differentiable routing.
- Use a linear warm-up schedule for pruning budgets.
- Supervise relevance with a future interaction corridor.
Topics
- Multi-View 3D Object Detection
- Vision Transformers
- Sparse Token Relevance
- Planning-Aligned Perception
- nuScenes-Relevance Benchmark
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.