Ternary Mamba: Grouped Quantization-Aware Training of W1.58A16 State Space Models
Summary
Ternary Mamba introduces a grouped quantization-aware training (QAT) method for compressing State Space Models (SSMs) like Mamba-2, addressing their significant memory footprint for edge deployment. Unlike prior ternary SSM approaches that require extensive from-scratch training on 150B tokens, Ternary Mamba leverages a pretrained checkpoint and knowledge distillation from a frozen FP16 teacher, drastically reducing the token budget by 1,000x. This technique compresses Mamba-2 1.3B by 3.61x, from 2,687 MB to 744 MB, while achieving a 48.1% zero-shot accuracy (7-task average) using only 102M tokens and 4 GPU-hours on a single H100. This performance closely matches Bi-Mamba's 48.4%. The research also identifies "zero-ratio collapse," a novel instability in QAT-from-pretrained settings, and notes that Transformer-specific post-hoc corrections are ineffective for SSMs due to error accumulation.
Key takeaway
For Machine Learning Engineers deploying State Space Models on memory-constrained edge devices, Ternary Mamba offers a critical path to efficiency. You should consider applying grouped quantization-aware training with knowledge distillation to pretrained SSMs, as this approach significantly reduces memory footprint (e.g., Mamba-2 1.3B to 744 MB) and training data requirements (102M tokens), bypassing the need for expensive from-scratch training. Be aware of potential "zero-ratio collapse" and the ineffectiveness of Transformer-specific post-hoc corrections.
Key insights
Ternary Mamba enables efficient compression of pretrained State Space Models using grouped QAT and knowledge distillation, avoiding expensive from-scratch training.
Principles
- Pretrained checkpoints enable efficient ternary SSMs.
- QAT with KD offers data-efficient SSM compression.
- SSMs face unique quantization stability issues.
Method
Grouped quantization-aware training (QAT) is applied to a pretrained Mamba-2 1.3B checkpoint, using knowledge distillation from a frozen FP16 teacher to achieve W1.58A16 compression.
In practice
- Compress Mamba-2 1.3B to 744 MB.
- Achieve 48.1% accuracy with 102M tokens.
- Avoid 150B token from-scratch training.
Topics
- State Space Models
- Mamba-2
- Quantization-Aware Training
- Knowledge Distillation
- Model Compression
- Edge AI Deployment
- Ternary Models
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Hardware Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.