TWLA: Achieving Ternary Weights and Low-Bit Activations for LLMs via Post-Training Quantization

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

TWLA, a novel post-training quantization (PTQ) framework, enables large language models (LLMs) to achieve 1.58-bit weight compression and 4-bit activation quantization while maintaining high accuracy. This framework addresses the significant memory and compute costs that hinder LLM deployment by overcoming limitations of existing ternarization methods, which often fail with heavy-tailed activation distributions. TWLA integrates three core components: the Euclidean-to-Manifold Asymmetric Ternary Quantizer (E2M-ATQ) for minimizing layer-output error, Kronecker Orthogonal Tri-Modal Shaping (KOTMS) to transform weights into ternary-friendly distributions and suppress activation outliers, and Inter-Layer Aware Activation Mixed Precision (ILA-AMP) for optimizing bit allocation across layers. Extensive experiments demonstrate TWLA's superior performance, delivering a 3.64x speedup over FP16 and reducing LLaMA2-13B's parameter memory from 23.7GB to 3.34GB.

Key takeaway

For Machine Learning Engineers deploying LLMs on edge devices or resource-limited platforms, TWLA offers a robust solution to significantly reduce memory and improve inference speed. You should consider integrating this PTQ framework to achieve 1.58-bit weights and 4-bit activations, potentially cutting memory by over 80% and boosting throughput by 3.64x compared to FP16. This enables broader access to LLM applications in privacy-sensitive or latency-critical use cases.

Key insights

TWLA enables efficient LLM inference by jointly quantizing weights to 1.58-bit and activations to 4-bit via a three-module PTQ framework.

Principles

Ternarization needs tri-modal weight distributions.
Orthogonal mixing suppresses activation outliers.
Cross-layer interactions impact quantization.

Method

TWLA uses E2M-ATQ for weight ternarization, KOTMS for weight shaping and outlier suppression, and ILA-AMP for inter-layer aware mixed-precision activation bit allocation.

In practice

Apply Kronecker-structured orthogonal rotations.
Use two-stage Euclidean-to-Manifold optimization.
Model adjacent-layer interaction costs for bit allocation.

Topics

Large Language Models
Post-Training Quantization
Ternary Quantization
Low-Bit Activations
Model Compression
Inference Optimization

Code references

Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.