Technical Report for ICRA 2026 GOOSE 2D Fine-Grained Semantic Segmentation Challenge: Leveraging DINOv3 for Robust Outdoor Scene Understanding in Field Robotics

2026-06-17 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

The first-place solution for the ICRA 2026 GOOSE 2D Fine-Grained Semantic Segmentation Challenge achieved a 76.57% composite score, comprising 69.32% fine-class mIoU and 83.81% category-level mIoU. This solution, designed for robust outdoor scene understanding in field robotics, integrates a network-level design with an inference-time aggregation strategy. The network combines a self-supervised DINOv3 ViT-L/16 backbone, a ViT-Adapter, and a Mask2Former mask-classification decoder, enhanced by a coarse-category auxiliary loss on the global [CLS] token. For inference, it utilizes multi-scale and horizontal-flip test-time augmentation, alongside an ensemble of the top three checkpoints selected via Codabench scores. This method successfully addresses the challenge's requirement for dense semantic segmentation across 64 fine-grained classes and 11 coarse categories in off-road imagery.

Key takeaway

For Computer Vision Engineers developing robust semantic segmentation for field robotics, this first-place solution demonstrates a powerful architecture. You should consider integrating self-supervised DINOv3 backbones with Mask2Former decoders and coarse-category auxiliary losses. Furthermore, implementing multi-scale and horizontal-flip test-time augmentation, combined with ensembling top model checkpoints, can significantly improve your system's mIoU for fine-grained outdoor scene understanding.

Key insights

Combining DINOv3, Mask2Former, and test-time augmentation significantly boosts fine-grained semantic segmentation performance in challenging outdoor environments.

Principles

Self-supervised backbones enhance segmentation.
Multi-model ensembling improves robustness.
Auxiliary losses guide fine-grained learning.

Method

The solution integrates a DINOv3 ViT-L/16 backbone, ViT-Adapter, and Mask2Former decoder with a coarse-category auxiliary loss. Inference uses multi-scale/horizontal-flip test-time augmentation and a three-checkpoint ensemble.

In practice

Use DINOv3 with Mask2Former for segmentation.
Apply multi-scale TTA for robust inference.
Ensemble top checkpoints for higher scores.

Topics

Semantic Segmentation
DINOv3
Mask2Former
Field Robotics
Test-Time Augmentation
Model Ensembling

Best for: Research Scientist, AI Scientist, Computer Vision Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.