HiSem: Hierarchical Semantic Disentangling for Remote Sensing Image Change Captioning

2026-05-14 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

HiSem, a novel hierarchical semantic disentangling network, addresses limitations in Remote Sensing Image Change Captioning (RSICC) by explicitly disentangling semantic representations of varying granularities. Existing RSICC methods often process changed and unchanged image pairs, which possess different semantic granularities, under a unified modeling strategy, leading to semantic entanglement. HiSem introduces the Bidirectional Differential Attention Modulation (BDAM) module to enhance cross-temporal interactions and amplify true change signals. Furthermore, it incorporates a Hierarchical Adaptive Semantic Disentanglement (HASD) module, which employs a coarse-grained image-level routing mechanism to distinguish changed and unchanged pairs, and a fine-grained token-level Mixture-of-Experts (MoE) block to model diverse change semantics. Experiments on two benchmark datasets show HiSem outperforms previous methods, achieving a +7.52% BLEU-4 improvement on the WHU-CDC dataset.

Key takeaway

For research scientists developing RSICC models, you should consider adopting hierarchical semantic disentangling strategies. Explicitly separating coarse-grained change judgment from fine-grained semantic understanding, as demonstrated by HiSem's +7.52% BLEU-4 improvement on WHU-CDC, can significantly enhance model accuracy and provide a more structured approach to bi-temporal scene analysis.

Key insights

Disentangling semantic granularities hierarchically improves remote sensing image change captioning performance.

Principles

Align model design with intrinsic semantic heterogeneity.
Amplify true change signals via discrepancy-aware attention.

Method

HiSem uses Bidirectional Differential Attention Modulation (BDAM) for cross-temporal interaction, followed by Hierarchical Adaptive Semantic Disentanglement (HASD) with image-level routing and a token-level Mixture-of-Experts (MoE) for semantic modeling.

In practice

Use BDAM for enhancing change detection.
Implement MoE for diverse semantic modeling.

Topics

Remote Sensing Image Change Captioning
Hierarchical Semantic Disentangling
Bidirectional Differential Attention Modulation
Hierarchical Adaptive Semantic Disentanglement
Mixture-of-Experts

Code references

Man-Wang-star/HiSem

Best for: Research Scientist, AI Scientist, Computer Vision Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.