LESSViT: Robust Hyperspectral Representation Learning under Spectral Configuration Shift
Summary
LESSViT, or Low-rank Efficient Spatial-Spectral ViT, is a new sensor-flexible architecture designed to improve the robustness and generalization of hyperspectral imagery (HSI) models across different sensors. Traditional Vision Transformer (ViT) methods struggle with varying wavelength coverage, band sampling, and channel dimensionality, often failing to generalize when spectral configurations shift. LESSViT addresses this by using LESS Attention, a structured low-rank factorization that models joint spatial-spectral interactions through separable spatial and spectral components. This reduces the computational complexity of full spatial-spectral attention from O(N^2 C^2) to O(rNC). The architecture also features channel-agnostic patch embedding and wavelength-aware positional encoding for flexible spectral inputs. For efficient pretraining, LESSViT incorporates HyperMAE, a hyperspectral masked autoencoder with decoupled spatial-spectral masking and hierarchical channel sampling. Experiments on the SpectralEarth benchmark confirm LESSViT's improved robustness under spectral shifts while maintaining competitive in-distribution performance.
Key takeaway
For research scientists developing hyperspectral imagery models, LESSViT offers a robust solution to the pervasive challenge of spectral configuration shifts across different sensors. You should consider integrating its LESS Attention mechanism and HyperMAE pretraining approach to build models that generalize more effectively. This architecture directly addresses the trade-off between efficiency and expressiveness, enabling more scalable and adaptable HSI representation learning in real-world applications.
Key insights
LESSViT enhances hyperspectral image model generalization across sensors via efficient low-rank spatial-spectral attention.
Principles
- Explicit spatial-spectral modeling is crucial.
- Low-rank factorization improves efficiency.
- Channel-agnostic embeddings support flexible inputs.
Method
LESSViT uses LESS Attention for joint spatial-spectral modeling, channel-agnostic patch embedding, and wavelength-aware positional encoding. It employs HyperMAE with decoupled spatial-spectral masking and hierarchical channel sampling for pretraining.
In practice
- Apply LESSViT for cross-sensor HSI tasks.
- Utilize HyperMAE for robust HSI pretraining.
- Consider low-rank attention for efficiency.
Topics
- Hyperspectral Imagery
- Vision Transformers
- Spectral Configuration Shift
- LESS Attention
- Cross-spectral Generalization
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.