Ternary Mamba: Grouped Quantization-Aware Training of W1.58A16 State Space Models

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Ternary Mamba introduces a grouped quantization-aware training (QAT) method for compressing State Space Models (SSMs) like Mamba-2, addressing their significant memory footprint for edge deployment. Unlike prior ternary SSM approaches that require extensive from-scratch training on 150B tokens, Ternary Mamba leverages a pretrained checkpoint and knowledge distillation from a frozen FP16 teacher, drastically reducing the token budget by 1,000x. This technique compresses Mamba-2 1.3B by 3.61x, from 2,687 MB to 744 MB, while achieving a 48.1% zero-shot accuracy (7-task average) using only 102M tokens and 4 GPU-hours on a single H100. This performance closely matches Bi-Mamba's 48.4%. The research also identifies "zero-ratio collapse," a novel instability in QAT-from-pretrained settings, and notes that Transformer-specific post-hoc corrections are ineffective for SSMs due to error accumulation.

Key takeaway

For Machine Learning Engineers deploying State Space Models on memory-constrained edge devices, Ternary Mamba offers a critical path to efficiency. You should consider applying grouped quantization-aware training with knowledge distillation to pretrained SSMs, as this approach significantly reduces memory footprint (e.g., Mamba-2 1.3B to 744 MB) and training data requirements (102M tokens), bypassing the need for expensive from-scratch training. Be aware of potential "zero-ratio collapse" and the ineffectiveness of Transformer-specific post-hoc corrections.

Key insights

Ternary Mamba enables efficient compression of pretrained State Space Models using grouped QAT and knowledge distillation, avoiding expensive from-scratch training.

Principles

Method

Grouped quantization-aware training (QAT) is applied to a pretrained Mamba-2 1.3B checkpoint, using knowledge distillation from a frozen FP16 teacher to achieve W1.58A16 compression.

In practice

Topics

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.