GOOSE-M2F: Adapting Mask2Former for High-Fidelity, Long-Tailed Fine-Grained Semantic Segmentation in Unstructured Outdoor Terrain

2026-06-14 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Computer Vision & Pattern Recognition · Depth: Expert, quick

Summary

GOOSE-M2F is a task-specific adaptation of Mask2Former designed for high-fidelity, long-tailed fine-grained semantic segmentation in unstructured outdoor terrain. Developed for the GOOSE 2D FGSS Challenge at ICRA 2026, this model addresses a benchmark with 64 fine-grained classes, where rare classes often occupy fewer than 50 pixels per image. The system extends the Swin-Large Mask2Former baseline with three key contributions: 200 Object Queries to prevent representational saturation, a Feature Refinement Module (FRM) combining ASPP-lite and CBAM dual-attention, and an Auxiliary Supervision Head for direct per-pixel gradients on rare classes. Its multi-stage training strategy incorporates Distribution-Balanced loss, Rare-Class Copy-Paste augmentation, dynamic IoU-aware re-weighting, and EMA. During inference, a dense sliding-window engine with 2D Gaussian kernel blending and 4-scale Test-Time Augmentation (TTA) boosts performance by +10.57%. GOOSE-M2F achieved 70.08% Official Composite mIoU (63.55% fine, 76.61% coarse), securing 3rd place on the GOOSE 2D FGSS leaderboard.

Key takeaway

For Computer Vision Engineers developing semantic segmentation models for challenging outdoor or fine-grained datasets, you should consider adapting Mask2Former with targeted architectural and training enhancements. Implementing increased object queries, a Feature Refinement Module, and an Auxiliary Supervision Head can significantly boost performance on long-tailed distributions. Your inference pipeline can further benefit from dense sliding-window processing with 4-scale Test-Time Augmentation, improving mIoU by over 10%.

Key insights

Adapting Mask2Former with specialized modules and training strategies improves fine-grained semantic segmentation on long-tailed outdoor datasets.

Principles

Address representational saturation with increased object queries.
Combine attention and spatial pooling for feature refinement.
Provide direct supervision for rare classes.

Method

Multi-stage training uses Distribution-Balanced loss, Copy-Paste augmentation, IoU-aware re-weighting, and EMA. Inference employs sliding-window with TTA.

In practice

Use 200 object queries for complex, fine-grained tasks.
Implement a Feature Refinement Module (ASPP-lite + CBAM).
Apply Auxiliary Supervision Heads for rare class gradients.

Topics

Semantic Segmentation
Mask2Former
Fine-Grained Segmentation
Long-Tailed Data
Outdoor Terrain
Test-Time Augmentation

Code references

Aditya-Lingam-9000/GOOSE-M2F

Best for: Research Scientist, AI Scientist, Computer Vision Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.