GOOSE-M2F: Adapting Mask2Former for High-Fidelity, Long-Tailed Fine-Grained Semantic Segmentation in Unstructured Outdoor Terrain

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Computer Vision & Pattern Recognition · Depth: Expert, quick

Summary

GOOSE-M2F is a task-specific adaptation of Mask2Former designed for high-fidelity, long-tailed fine-grained semantic segmentation in unstructured outdoor terrain. Developed for the GOOSE 2D FGSS Challenge at ICRA 2026, this model addresses a benchmark with 64 fine-grained classes, where rare classes often occupy fewer than 50 pixels per image. The system extends the Swin-Large Mask2Former baseline with three key contributions: 200 Object Queries to prevent representational saturation, a Feature Refinement Module (FRM) combining ASPP-lite and CBAM dual-attention, and an Auxiliary Supervision Head for direct per-pixel gradients on rare classes. Its multi-stage training strategy incorporates Distribution-Balanced loss, Rare-Class Copy-Paste augmentation, dynamic IoU-aware re-weighting, and EMA. During inference, a dense sliding-window engine with 2D Gaussian kernel blending and 4-scale Test-Time Augmentation (TTA) boosts performance by +10.57%. GOOSE-M2F achieved 70.08% Official Composite mIoU (63.55% fine, 76.61% coarse), securing 3rd place on the GOOSE 2D FGSS leaderboard.

Key takeaway

For Computer Vision Engineers developing semantic segmentation models for challenging outdoor or fine-grained datasets, you should consider adapting Mask2Former with targeted architectural and training enhancements. Implementing increased object queries, a Feature Refinement Module, and an Auxiliary Supervision Head can significantly boost performance on long-tailed distributions. Your inference pipeline can further benefit from dense sliding-window processing with 4-scale Test-Time Augmentation, improving mIoU by over 10%.

Key insights

Adapting Mask2Former with specialized modules and training strategies improves fine-grained semantic segmentation on long-tailed outdoor datasets.

Principles

Method

Multi-stage training uses Distribution-Balanced loss, Copy-Paste augmentation, IoU-aware re-weighting, and EMA. Inference employs sliding-window with TTA.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Computer Vision Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.