ConvVitMamba: Efficient Multiscale Convolution, Transformer, and Mamba-Based Sequence modelling for Hyperspectral Image Classification

2026-04-22 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision, Environmental Science & Earth Systems · Depth: Expert, extended

Summary

ConvVitMamba is a novel hybrid deep learning framework designed for efficient hyperspectral image (HSI) classification, addressing challenges like high spectral dimensionality, redundancy, and limited labeled data. The architecture unifies three components: a multiscale convolutional feature extractor for local spectral, spatial, and spectral-spatial patterns; a Vision Transformer (ViT) for global contextual relationships; and a lightweight Mamba-inspired gated sequence mixing module for efficient content-aware sequence refinement. Principal Component Analysis (PCA) is used for dimensionality reduction. Evaluated on four benchmark HSI datasets (Houston, QUH-Pingan, QUH-Qingyun, and QUH-Tangdaowan), ConvVitMamba consistently outperforms existing CNN-, Transformer-, and Mamba-based methods in Overall Accuracy, Average Accuracy, and Kappa coefficient. It achieves this with a compact model size of 384,010 parameters and moderate computational cost (22.7M FLOPs), demonstrating a favorable balance between accuracy and inference efficiency.

Key takeaway

For Computer Vision Engineers developing hyperspectral image classification systems, ConvVitMamba offers a robust and efficient solution. Its hybrid architecture, integrating multiscale CNNs, Vision Transformers, and Mamba-inspired sequence mixing, provides superior accuracy with a compact model size and faster inference compared to many existing methods. You should consider adopting this framework to achieve high performance in diverse HSI scenarios, especially when balancing classification accuracy with computational efficiency and limited labeled data is critical.

Key insights

ConvVitMamba efficiently classifies hyperspectral images by combining multiscale CNNs, Vision Transformers, and Mamba-inspired sequence mixing.

Principles

Hybrid architectures improve HSI classification.
Multiscale feature extraction enhances diversity.
Content-aware sequence mixing refines representations.

Method

The ConvVitMamba method preprocesses HSI data with PCA, then extracts features using parallel 3D convolutional branches, processes tokens with a Vision Transformer, and refines them via a Mamba-inspired gated sequence mixing module before classification.

In practice

Use PCA for HSI dimensionality reduction.
Combine local CNN features with global Transformer context.
Apply Mamba-style modules for efficient sequence refinement.

Topics

Hyperspectral Image Classification
ConvVitMamba
Multiscale CNN
Vision Transformers
Mamba Sequence Modelling

Code references

mqalkhatib/ConvVitMamba

Best for: Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.