Why SSMs struggle in parameter-constrained training: empirical findings at 25M parameters [R]

2026-05-04 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, short

Summary

Empirical findings from OpenAI's Parameter Golf competition reveal that State Space Models (SSMs) are structurally disadvantaged compared to Transformers under strict parameter, time, and size constraints, specifically 25M parameters, 10 minutes training, and a 16MB artifact limit on 8xH100s. A key finding is that SSM in_proj weights compress up to 3.26x worse than Transformer attention QKV weights under LZMA, directly impacting the compressed parameter budget. Additionally, architectural configurations that showed promise at a vocabulary size of SP4096 reversed performance at SP8192, indicating a "canonical SSM failure mode" where fixed state size struggles with increased sequence length. Kernel-level experiments on Mamba-3 Triton kernels also identified a 16% slowdown from a backward fusion attempt due to SMEM pressure, a 5.5 mBPB loss from a torch.compile quantizer bug, and a 0.8 mBPB recovery from mixed-precision dynamics protection.

Key takeaway

For AI Engineers optimizing models under strict parameter and artifact size constraints, recognize that SSMs face inherent challenges with weight compressibility and state scaling compared to Transformers. Your model's final compressed size can be significantly impacted by weight distribution, so consider training with regularizers that encourage compressible weight structures. Be mindful of how vocabulary size changes can flip performance, especially with SSMs, and implement mixed-precision dynamics protection for selective-scan operations.

Key insights

SSMs struggle in parameter-constrained settings due to poor weight compressibility and state size limitations.

Principles

LZMA compression varies by weight distribution.
SSM state size must scale with sequence length.
Mixed-precision requires dynamics protection.

Method

The study involved experimentation in a parameter-constrained competition, analyzing weight compressibility with LZMA, evaluating architectural configurations across vocabulary sizes, and conducting kernel-level experiments on Mamba-3 Triton kernels.

In practice

Consider LZMA compressibility for model weights.
Regularize weights for compressible distributions.
Implement fp32 patches for selective-scan in mixed precision.

Topics

State Space Models
Transformer Architecture
Parameter-Constrained Training
LZMA Compression
Mamba-3 Triton Kernels

Code references

swfsql/burn-mamba

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.