TWLA: Achieving Ternary Weights and Low-Bit Activations for LLMs via Post-Training Quantization
Summary
TWLA is a novel post-training quantization (PTQ) framework designed to significantly reduce the memory and compute costs of large language models (LLMs). Published on 2026-06-11, TWLA achieves 1.58-bit weight compression and 4-bit activation quantization, addressing the limitations of existing ternarization methods that struggle with heavy-tailed activation distributions. The framework integrates three key components: Euclidean-to-Manifold Asymmetric Ternary Quantizer (E2M-ATQ) for minimizing layer-output error, Kronecker Orthogonal Tri-Modal Shaping (KOTMS) to reshape weights and suppress activation outliers, and Inter-Layer Aware Activation Mixed Precision (ILA-AMP) for optimizing bit allocation by considering adjacent-layer interactions. Extensive experiments confirm TWLA maintains high accuracy under W1.58A4 while delivering substantial inference acceleration.
Key takeaway
For Machine Learning Engineers deploying LLMs with stringent memory and compute constraints, TWLA offers a critical advancement. This framework allows you to achieve W1.58A4 quantization, significantly reducing model size and accelerating inference without sacrificing accuracy. Consider integrating TWLA's techniques to overcome activation quantization hurdles and optimize your LLM deployments for efficiency and performance.
Key insights
TWLA enables aggressive LLM quantization (W1.58A4) by addressing activation distribution challenges and inter-layer effects.
Principles
- Ternarization significantly compresses LLMs.
- Heavy-tailed activations limit end-to-end acceleration.
- Inter-layer interaction costs prevent quantization cascades.
Method
TWLA employs E2M-ATQ for weight ternarization, KOTMS for reshaping weights and suppressing outliers, and ILA-AMP for mixed-precision bit allocation considering adjacent-layer interactions.
In practice
- Achieve 1.58-bit weight and 4-bit activation quantization.
- Reduce LLM memory and compute costs.
- Enable significant inference acceleration.
Topics
- Large Language Models
- Post-Training Quantization
- Ternary Quantization
- Model Compression
- Low-Bit Activation
- Inference Acceleration
Best for: MLOps Engineer, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.