TWLA: Achieving Ternary Weights and Low-Bit Activations for LLMs via Post-Training Quantization
Summary
TWLA, a novel post-training quantization (PTQ) framework, enables large language models (LLMs) to achieve 1.58-bit weight compression and 4-bit activation quantization while maintaining high accuracy. This framework addresses the significant memory and compute costs that hinder LLM deployment by overcoming limitations of existing ternarization methods, which often fail with heavy-tailed activation distributions. TWLA integrates three core components: the Euclidean-to-Manifold Asymmetric Ternary Quantizer (E2M-ATQ) for minimizing layer-output error, Kronecker Orthogonal Tri-Modal Shaping (KOTMS) to transform weights into ternary-friendly distributions and suppress activation outliers, and Inter-Layer Aware Activation Mixed Precision (ILA-AMP) for optimizing bit allocation across layers. Extensive experiments demonstrate TWLA's superior performance, delivering a 3.64x speedup over FP16 and reducing LLaMA2-13B's parameter memory from 23.7GB to 3.34GB.
Key takeaway
For Machine Learning Engineers deploying LLMs on edge devices or resource-limited platforms, TWLA offers a robust solution to significantly reduce memory and improve inference speed. You should consider integrating this PTQ framework to achieve 1.58-bit weights and 4-bit activations, potentially cutting memory by over 80% and boosting throughput by 3.64x compared to FP16. This enables broader access to LLM applications in privacy-sensitive or latency-critical use cases.
Key insights
TWLA enables efficient LLM inference by jointly quantizing weights to 1.58-bit and activations to 4-bit via a three-module PTQ framework.
Principles
- Ternarization needs tri-modal weight distributions.
- Orthogonal mixing suppresses activation outliers.
- Cross-layer interactions impact quantization.
Method
TWLA uses E2M-ATQ for weight ternarization, KOTMS for weight shaping and outlier suppression, and ILA-AMP for inter-layer aware mixed-precision activation bit allocation.
In practice
- Apply Kronecker-structured orthogonal rotations.
- Use two-stage Euclidean-to-Manifold optimization.
- Model adjacent-layer interaction costs for bit allocation.
Topics
- Large Language Models
- Post-Training Quantization
- Ternary Quantization
- Low-Bit Activations
- Model Compression
- Inference Optimization
Code references
Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.