SToRe3D: Sparse Token Relevance in ViTs for Efficient Multi-View 3D Object Detection

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

SToRe3D is a novel relevance-aligned sparsity framework designed to enhance the efficiency of Vision Transformers (ViTs) for multi-view 3D object detection, particularly in autonomous driving. It addresses the high inference latency caused by dense token and query processing across multiple views and large 3D regions. Unlike prior methods that focus on 2D vision or isolated modalities, SToRe3D jointly selects 2D image tokens and 3D object queries, storing filtered features for reactivation. It employs mutual 2D–3D relevance heads to allocate compute to driving-critical content, supervised by a planner-inspired future interaction corridor. Evaluated on nuScenes and the new nuScenes-Relevance benchmark, SToRe3D achieves up to 3x faster inference with marginal accuracy loss, enabling real-time ViT-based 3D detection while maintaining accuracy on planning-critical agents, reaching approximately 18 FPS with a ViT-B backbone.

Key takeaway

For research scientists developing real-time 3D object detection systems for autonomous driving, SToRe3D offers a compelling approach to overcome latency bottlenecks. You should consider implementing joint 2D token and 3D query sparsity, guided by planning-aligned relevance, to achieve significant speedups without sacrificing critical detection accuracy. This method allows for real-time performance on ViT-based architectures, crucial for safety-critical applications.

Key insights

Jointly sparsifying 2D image tokens and 3D object queries with relevance-aligned filtering significantly boosts multi-view 3D detection efficiency.

Principles

Method

SToRe3D uses mutual 2D–3D relevance heads, supervised by future interaction corridors, to hierarchically filter and store low-relevance tokens/queries, reactivating them at the final layer to preserve context.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.