HAMSA: Scanning-Free Vision State Space Models via SpectralPulseNet
Summary
HAMSA is a novel Vision State Space Model (SSM) that operates directly in the spectral domain, eliminating the complex scanning strategies used by existing SSMs like Vim, VMamba, and SiMBA. It achieves this through three innovations: a simplified kernel parameterization using a single Gaussian-initialized complex kernel, SpectralPulseNet (SPN) for input-dependent frequency gating, and a Spectral Adaptive Gating Unit (SAGU) for stable gradient flow. By employing FFT-based convolution, HAMSA maintains O(L log L) complexity. On ImageNet-1K, HAMSA achieves 85.7% top-1 accuracy, outperforming other SSMs, and demonstrates 2.2X faster inference than transformers (4.2ms vs 9.2ms for DeiT-S). It also offers 1.4-1.9X speedup over scanning-based SSMs, uses less memory (2.1GB vs 3.2-4.5GB), and consumes less energy (12.5J vs 18-25J), while showing strong generalization across various tasks.
Key takeaway
For AI engineers and research scientists developing vision models, HAMSA presents a compelling alternative to traditional scanning-based SSMs and transformers. Its spectral domain processing and simplified architecture offer significant improvements in inference speed, memory footprint, and energy efficiency. You should consider evaluating HAMSA for computer vision tasks requiring high performance and resource optimization, especially for deployment on constrained hardware.
Key insights
HAMSA introduces a scanning-free Vision SSM operating in the spectral domain for improved efficiency and performance.
Principles
- Spectral domain processing can simplify SSMs.
- Input-dependent frequency gating enhances adaptability.
Method
HAMSA uses a single Gaussian-initialized complex kernel, SpectralPulseNet for frequency gating, and Spectral Adaptive Gating Unit for stable gradients, all leveraging FFT-based convolution.
In practice
- Achieves 85.7% top-1 accuracy on ImageNet-1K.
- 2.2X faster inference than DeiT-S transformers.
- Reduces memory to 2.1GB and energy to 12.5J.
Topics
- Vision State Space Models
- SpectralPulseNet
- Spectral Adaptive Gating Unit
- FFT-based Convolution
- ImageNet-1K
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.