Technical Report for ICRA 2026 GOOSE 2D Fine-Grained Semantic Segmentation Challenge: Leveraging DINOv3 for Robust Outdoor Scene Understanding in Field Robotics
Summary
The first-place solution for the ICRA 2026 GOOSE 2D Fine-Grained Semantic Segmentation Challenge achieved a 76.57% composite score, comprising 69.32% fine-class mIoU and 83.81% category-level mIoU. This solution, designed for robust outdoor scene understanding in field robotics, integrates a network-level design with an inference-time aggregation strategy. The network combines a self-supervised DINOv3 ViT-L/16 backbone, a ViT-Adapter, and a Mask2Former mask-classification decoder, enhanced by a coarse-category auxiliary loss on the global [CLS] token. For inference, it utilizes multi-scale and horizontal-flip test-time augmentation, alongside an ensemble of the top three checkpoints selected via Codabench scores. This method successfully addresses the challenge's requirement for dense semantic segmentation across 64 fine-grained classes and 11 coarse categories in off-road imagery.
Key takeaway
For Computer Vision Engineers developing robust semantic segmentation for field robotics, this first-place solution demonstrates a powerful architecture. You should consider integrating self-supervised DINOv3 backbones with Mask2Former decoders and coarse-category auxiliary losses. Furthermore, implementing multi-scale and horizontal-flip test-time augmentation, combined with ensembling top model checkpoints, can significantly improve your system's mIoU for fine-grained outdoor scene understanding.
Key insights
Combining DINOv3, Mask2Former, and test-time augmentation significantly boosts fine-grained semantic segmentation performance in challenging outdoor environments.
Principles
- Self-supervised backbones enhance segmentation.
- Multi-model ensembling improves robustness.
- Auxiliary losses guide fine-grained learning.
Method
The solution integrates a DINOv3 ViT-L/16 backbone, ViT-Adapter, and Mask2Former decoder with a coarse-category auxiliary loss. Inference uses multi-scale/horizontal-flip test-time augmentation and a three-checkpoint ensemble.
In practice
- Use DINOv3 with Mask2Former for segmentation.
- Apply multi-scale TTA for robust inference.
- Ensemble top checkpoints for higher scores.
Topics
- Semantic Segmentation
- DINOv3
- Mask2Former
- Field Robotics
- Test-Time Augmentation
- Model Ensembling
Best for: Research Scientist, AI Scientist, Computer Vision Engineer, Robotics Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.