TWLA: Achieving Ternary Weights and Low-Bit Activations for LLMs via Post-Training Quantization
Summary
TWLA is a novel post-training quantization (PTQ) framework designed to address the memory and compute costs of Large Language Models (LLMs) by achieving significant compression. Unlike existing methods that struggle with heavy-tailed activation distributions and maintain high-precision activations, TWLA successfully implements 1.58-bit weight compression and 4-bit activation quantization while preserving high accuracy. The framework integrates three key components: the Euclidean-to-Manifold Asymmetric Ternary Quantizer (E2M-ATQ) for minimizing layer-output error during weight ternarization; Kronecker Orthogonal Tri-Modal Shaping (KOTMS) which reshapes weights into ternary-friendly distributions and suppresses activation outliers; and Inter-Layer Aware Activation Mixed Precision (ILA-AMP) for optimizing bit allocation by considering adjacent-layer interactions. Extensive experiments confirm TWLA's ability to maintain high accuracy under W1.58A4, delivering substantial inference acceleration.
Key takeaway
For Machine Learning Engineers deploying Large Language Models with stringent memory and compute constraints, you should evaluate TWLA. This post-training quantization framework enables W1.58A4 compression, significantly reducing model size and accelerating inference without sacrificing accuracy, even with challenging activation distributions. Consider integrating TWLA to achieve substantial cost savings and broader deployment capabilities for your LLM applications.
Key insights
TWLA enables 1.58-bit weights and 4-bit activations for LLMs, overcoming heavy-tailed activation challenges for end-to-end acceleration.
Principles
- Minimize layer-output error via two-stage optimization.
- Reshape weights into ternary-friendly distributions.
- Optimize bit allocation considering adjacent-layer interactions.
Method
TWLA employs E2M-ATQ for weight ternarization, KOTMS for weight reshaping and outlier suppression, and ILA-AMP for inter-layer aware activation mixed precision bit allocation.
In practice
- Achieve W1.58A4 compression for LLMs.
- Accelerate LLM inference significantly.
- Utilize provided code for TWLA implementation.
Topics
- Large Language Models
- Post-Training Quantization
- Ternary Weights
- Low-Bit Activations
- Model Compression
- Inference Acceleration
Code references
Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.