GOOSE-M2F: Adapting Mask2Former for High-Fidelity, Long-Tailed Fine-Grained Semantic Segmentation in Unstructured Outdoor Terrain
Summary
GOOSE-M2F is a task-specific adaptation of Mask2Former designed for high-fidelity, long-tailed fine-grained semantic segmentation in unstructured outdoor terrain. Developed for the GOOSE 2D FGSS Challenge at ICRA 2026, this model addresses a benchmark with 64 fine-grained classes, where rare classes often occupy fewer than 50 pixels per image. The system extends the Swin-Large Mask2Former baseline with three key contributions: 200 Object Queries to prevent representational saturation, a Feature Refinement Module (FRM) combining ASPP-lite and CBAM dual-attention, and an Auxiliary Supervision Head for direct per-pixel gradients on rare classes. Its multi-stage training strategy incorporates Distribution-Balanced loss, Rare-Class Copy-Paste augmentation, dynamic IoU-aware re-weighting, and EMA. During inference, a dense sliding-window engine with 2D Gaussian kernel blending and 4-scale Test-Time Augmentation (TTA) boosts performance by +10.57%. GOOSE-M2F achieved 70.08% Official Composite mIoU (63.55% fine, 76.61% coarse), securing 3rd place on the GOOSE 2D FGSS leaderboard.
Key takeaway
For Computer Vision Engineers developing semantic segmentation models for challenging outdoor or fine-grained datasets, you should consider adapting Mask2Former with targeted architectural and training enhancements. Implementing increased object queries, a Feature Refinement Module, and an Auxiliary Supervision Head can significantly boost performance on long-tailed distributions. Your inference pipeline can further benefit from dense sliding-window processing with 4-scale Test-Time Augmentation, improving mIoU by over 10%.
Key insights
Adapting Mask2Former with specialized modules and training strategies improves fine-grained semantic segmentation on long-tailed outdoor datasets.
Principles
- Address representational saturation with increased object queries.
- Combine attention and spatial pooling for feature refinement.
- Provide direct supervision for rare classes.
Method
Multi-stage training uses Distribution-Balanced loss, Copy-Paste augmentation, IoU-aware re-weighting, and EMA. Inference employs sliding-window with TTA.
In practice
- Use 200 object queries for complex, fine-grained tasks.
- Implement a Feature Refinement Module (ASPP-lite + CBAM).
- Apply Auxiliary Supervision Heads for rare class gradients.
Topics
- Semantic Segmentation
- Mask2Former
- Fine-Grained Segmentation
- Long-Tailed Data
- Outdoor Terrain
- Test-Time Augmentation
Code references
Best for: Research Scientist, AI Scientist, Computer Vision Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.