Why SSMs struggle in parameter-constrained training: empirical findings at 25M parameters [R]
Summary
Empirical findings from OpenAI's Parameter Golf competition reveal that State Space Models (SSMs) are structurally disadvantaged compared to Transformers under strict parameter, time, and size constraints, specifically 25M parameters, 10 minutes training, and a 16MB artifact limit on 8xH100s. A key finding is that SSM in_proj weights compress up to 3.26x worse than Transformer attention QKV weights under LZMA, directly impacting the compressed parameter budget. Additionally, architectural configurations that showed promise at a vocabulary size of SP4096 reversed performance at SP8192, indicating a "canonical SSM failure mode" where fixed state size struggles with increased sequence length. Kernel-level experiments on Mamba-3 Triton kernels also identified a 16% slowdown from a backward fusion attempt due to SMEM pressure, a 5.5 mBPB loss from a torch.compile quantizer bug, and a 0.8 mBPB recovery from mixed-precision dynamics protection.
Key takeaway
For AI Engineers optimizing models under strict parameter and artifact size constraints, recognize that SSMs face inherent challenges with weight compressibility and state scaling compared to Transformers. Your model's final compressed size can be significantly impacted by weight distribution, so consider training with regularizers that encourage compressible weight structures. Be mindful of how vocabulary size changes can flip performance, especially with SSMs, and implement mixed-precision dynamics protection for selective-scan operations.
Key insights
SSMs struggle in parameter-constrained settings due to poor weight compressibility and state size limitations.
Principles
- LZMA compression varies by weight distribution.
- SSM state size must scale with sequence length.
- Mixed-precision requires dynamics protection.
Method
The study involved experimentation in a parameter-constrained competition, analyzing weight compressibility with LZMA, evaluating architectural configurations across vocabulary sizes, and conducting kernel-level experiments on Mamba-3 Triton kernels.
In practice
- Consider LZMA compressibility for model weights.
- Regularize weights for compressible distributions.
- Implement fp32 patches for selective-scan in mixed precision.
Topics
- State Space Models
- Transformer Architecture
- Parameter-Constrained Training
- LZMA Compression
- Mamba-3 Triton Kernels
Code references
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.