HAMSA: Scanning-Free Vision State Space Models via SpectralPulseNet

2026-04-16 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

HAMSA is a novel Vision State Space Model (SSM) that operates directly in the spectral domain, eliminating the complex scanning strategies used by existing SSMs like Vim, VMamba, and SiMBA. It achieves this through three innovations: a simplified kernel parameterization using a single Gaussian-initialized complex kernel, SpectralPulseNet (SPN) for input-dependent frequency gating, and a Spectral Adaptive Gating Unit (SAGU) for stable gradient flow. By employing FFT-based convolution, HAMSA maintains O(L log L) complexity. On ImageNet-1K, HAMSA achieves 85.7% top-1 accuracy, outperforming other SSMs, and demonstrates 2.2X faster inference than transformers (4.2ms vs 9.2ms for DeiT-S). It also offers 1.4-1.9X speedup over scanning-based SSMs, uses less memory (2.1GB vs 3.2-4.5GB), and consumes less energy (12.5J vs 18-25J), while showing strong generalization across various tasks.

Key takeaway

For AI engineers and research scientists developing vision models, HAMSA presents a compelling alternative to traditional scanning-based SSMs and transformers. Its spectral domain processing and simplified architecture offer significant improvements in inference speed, memory footprint, and energy efficiency. You should consider evaluating HAMSA for computer vision tasks requiring high performance and resource optimization, especially for deployment on constrained hardware.

Key insights

HAMSA introduces a scanning-free Vision SSM operating in the spectral domain for improved efficiency and performance.

Principles

Spectral domain processing can simplify SSMs.
Input-dependent frequency gating enhances adaptability.

Method

HAMSA uses a single Gaussian-initialized complex kernel, SpectralPulseNet for frequency gating, and Spectral Adaptive Gating Unit for stable gradients, all leveraging FFT-based convolution.

In practice

Achieves 85.7% top-1 accuracy on ImageNet-1K.
2.2X faster inference than DeiT-S transformers.
Reduces memory to 2.1GB and energy to 12.5J.

Topics

Vision State Space Models
SpectralPulseNet
Spectral Adaptive Gating Unit
FFT-based Convolution
ImageNet-1K

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.