Revisiting Token Compression for Accelerating ViT-based Sparse Multi-View 3D Object Detectors
Summary
SEPatch3D is a novel framework designed to accelerate Vision Transformer (ViT)-based sparse multi-view 3D object detectors, which typically suffer from high inference latency due to extensive token processing. Existing token compression methods, including pruning, merging, and patch size enlargement, often degrade 3D detection accuracy by discarding informative background cues, disrupting contextual consistency, and losing fine-grained semantics. SEPatch3D addresses these issues by dynamically adjusting patch sizes while preserving critical semantic information. It incorporates Spatiotemporal-aware Patch Size Selection (SPSS) to assign small patches for nearby objects and large patches for background-dominated scenes, reducing computation. Additionally, Informative Patch Selection (IPS) refines features from key patches, and Cross-Granularity Feature Enhancement (CGFE) enriches coarse patches with fine-grained details. Experiments on nuScenes and Argoverse 2 validation sets demonstrate SEPatch3D achieves up to 57% faster inference than StreamPETR and 20% higher efficiency than ToC3D-faster, maintaining comparable detection accuracy.
Key takeaway
For research scientists developing or deploying ViT-based sparse multi-view 3D object detectors, you should investigate SEPatch3D to significantly reduce inference latency. Its dynamic patch sizing and semantic preservation techniques offer a robust solution to accelerate models like StreamPETR and ToC3D-faster by up to 57% and 20% respectively, without compromising detection accuracy on datasets like nuScenes and Argoverse 2. Consider integrating its principles to optimize your own models.
Key insights
Dynamic patch size adjustment and semantic preservation accelerate ViT-based 3D object detection without accuracy loss.
Principles
- Preserve critical semantic information
- Dynamically adjust patch sizes
- Refine informative patches
Method
SEPatch3D uses Spatiotemporal-aware Patch Size Selection (SPSS) for dynamic patch sizing, Informative Patch Selection (IPS) for feature refinement, and Cross-Granularity Feature Enhancement (CGFE) to inject fine-grained details into coarse patches.
In practice
- Accelerate ViT-based 3D detectors
- Improve inference efficiency
- Maintain detection accuracy
Topics
- ViT-based Detectors
- Token Compression
- 3D Object Detection
- SEPatch3D
- Inference Acceleration
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.