TWLA: Achieving Ternary Weights and Low-Bit Activations for LLMs via Post-Training Quantization

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

TWLA, a novel post-training quantization (PTQ) framework, enables large language models (LLMs) to achieve 1.58-bit weight compression and 4-bit activation quantization while maintaining high accuracy. This framework addresses the significant memory and compute costs that hinder LLM deployment by overcoming limitations of existing ternarization methods, which often fail with heavy-tailed activation distributions. TWLA integrates three core components: the Euclidean-to-Manifold Asymmetric Ternary Quantizer (E2M-ATQ) for minimizing layer-output error, Kronecker Orthogonal Tri-Modal Shaping (KOTMS) to transform weights into ternary-friendly distributions and suppress activation outliers, and Inter-Layer Aware Activation Mixed Precision (ILA-AMP) for optimizing bit allocation across layers. Extensive experiments demonstrate TWLA's superior performance, delivering a 3.64x speedup over FP16 and reducing LLaMA2-13B's parameter memory from 23.7GB to 3.34GB.

Key takeaway

For Machine Learning Engineers deploying LLMs on edge devices or resource-limited platforms, TWLA offers a robust solution to significantly reduce memory and improve inference speed. You should consider integrating this PTQ framework to achieve 1.58-bit weights and 4-bit activations, potentially cutting memory by over 80% and boosting throughput by 3.64x compared to FP16. This enables broader access to LLM applications in privacy-sensitive or latency-critical use cases.

Key insights

TWLA enables efficient LLM inference by jointly quantizing weights to 1.58-bit and activations to 4-bit via a three-module PTQ framework.

Principles

Method

TWLA uses E2M-ATQ for weight ternarization, KOTMS for weight shaping and outlier suppression, and ILA-AMP for inter-layer aware mixed-precision activation bit allocation.

In practice

Topics

Code references

Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.