Robust Onion: Peeling Open Vocab Object Detectors Under Noise
Summary
The empirical study "Robust Onion" investigates the impact of real-world noise on Open Vocabulary Object Detectors (OV-ODs), a poorly understood area due to their architectural complexity. Using controlled synthetic visual degradations, the research systematically analyzes feature collapse by peeling OV-ODs layer-by-layer. Findings indicate that models sharing similar vision backbones exhibit comparable robustness, primarily driven by similar feature collapse at corresponding layers. Factors like pretraining strategy, architectural nuances, and caption supervision contribute minimally to robustness. The study reveals that robustness is predominantly governed by the image domain rather than annotations, explaining similar impacts on COCO and LVIS datasets and why ODinW-13 might suggest inflated robustness due to large, isolated objects. Robust Onion validates its insights by improving real-world robustness on BDD100K, WiderFace, and VisDRONE datasets through a lightweight, plug-and-play NN & TK0 approach, which utilizes 96x fewer trainable parameters than end-to-end training.
Key takeaway
For Machine Learning Engineers deploying Open Vocabulary Object Detectors in real-world, noisy environments, understand that your model's robustness is primarily tied to its vision backbone and the image domain, not pretraining strategies or caption supervision. You should prioritize evaluating OV-ODs against diverse image domains rather than solely relying on annotation quality. Consider implementing the lightweight NN & TK0 approach to efficiently improve robustness on datasets like BDD100K, WiderFace, or VisDRONE, utilizing 96x fewer trainable parameters than full retraining.
Key insights
Open Vocabulary Object Detector robustness is primarily governed by vision backbones and image domain, with pretraining and annotations having minimal impact.
Principles
- Vision backbones dictate OV-OD robustness.
- Image domain governs robustness, not annotations.
- Feature collapse drives robustness degradation.
Method
Robust Onion employs controlled synthetic visual degradations to analyze OV-ODs layer-by-layer, systematically revealing feature collapse. A lightweight NN & TK0 approach improves real-world robustness.
In practice
- Improve robustness on BDD100K, WiderFace, VisDRONE.
- Apply NN & TK0 for efficient robustness gains.
- Prioritize image domain in OV-OD evaluation.
Topics
- Open Vocabulary Object Detection
- Model Robustness
- Noise Degradation
- Feature Collapse
- Computer Vision
- NN & TK0
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.