Robust Multispectral Semantic Segmentation under Missing or Full Modalities via Structured Latent Projection

2026-04-21 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Environmental Science & Earth Systems · Depth: Expert, extended

Summary

Irem Ulku, Erdem Akagündüz, and Ömer Özgür Tanrıöver introduce CBC-SLP, a novel multimodal semantic segmentation model designed for remote sensing data that maintains robust performance under both full and missing modality scenarios. Existing models often face a trade-off, compromising performance when all modalities are available by over-relying on shared representations. CBC-SLP addresses this by incorporating a structured latent projection as an architectural inductive bias, which explicitly separates latent representations into shared and modality-specific components. These components are adaptively transferred to the decoder based on the availability mask, preserving complementary information. Experiments on three multimodal remote sensing image sets (DSTL, Potsdam, Hunan) demonstrate that CBC-SLP consistently outperforms other multimodal models across various modality availability scenarios, effectively recovering complementary information not preserved in purely shared representations.

Key takeaway

For Computer Vision Engineers developing remote sensing applications, if you are building multimodal semantic segmentation models that must perform reliably with both complete and incomplete sensor data, you should consider adopting an architectural design like CBC-SLP. This approach, which explicitly separates shared and modality-specific latent features, can significantly improve robustness under missing modalities without sacrificing accuracy when all data streams are available, offering a more versatile solution than models relying solely on shared representations.

Key insights

Structured latent projection in CBC-SLP preserves modality-specific information, enhancing multimodal semantic segmentation robustness.

Principles

Perfect modality alignment can reduce downstream task performance.
Retaining modality-specific components is beneficial for model performance.
Architectural inductive bias can replace explicit loss-based regularization.

Method

CBC-SLP uses ResNet-based 3D convolutional encoders, cross-modal fusion, and intra-modal self-attention. It then models inter-modal correlations and projects latent representations into shared and modality-specific components, routing them to the decoder via an availability mask.

In practice

Utilize multispectral, SAR, and DEM data for land-cover mapping.
Apply gating at encoder and latent levels for missing modalities.
Decompose latent representations into shared and private components.

Topics

Multispectral Semantic Segmentation
Missing Modality Robustness
Structured Latent Projection
Remote Sensing Imagery
Modality Fusion

Code references

iremulku/Multispectral-Semantic-Segmentation-via-Structured-Latent-Projection-CBC-SLP-

Best for: Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.