TWLA: Achieving Ternary Weights and Low-Bit Activations for LLMs via Post-Training Quantization

2026-06-11 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

TWLA is a novel post-training quantization (PTQ) framework designed to significantly reduce the memory and compute costs of large language models (LLMs). Published on 2026-06-11, TWLA achieves 1.58-bit weight compression and 4-bit activation quantization, addressing the limitations of existing ternarization methods that struggle with heavy-tailed activation distributions. The framework integrates three key components: Euclidean-to-Manifold Asymmetric Ternary Quantizer (E2M-ATQ) for minimizing layer-output error, Kronecker Orthogonal Tri-Modal Shaping (KOTMS) to reshape weights and suppress activation outliers, and Inter-Layer Aware Activation Mixed Precision (ILA-AMP) for optimizing bit allocation by considering adjacent-layer interactions. Extensive experiments confirm TWLA maintains high accuracy under W1.58A4 while delivering substantial inference acceleration.

Key takeaway

For Machine Learning Engineers deploying LLMs with stringent memory and compute constraints, TWLA offers a critical advancement. This framework allows you to achieve W1.58A4 quantization, significantly reducing model size and accelerating inference without sacrificing accuracy. Consider integrating TWLA's techniques to overcome activation quantization hurdles and optimize your LLM deployments for efficiency and performance.

Key insights

TWLA enables aggressive LLM quantization (W1.58A4) by addressing activation distribution challenges and inter-layer effects.

Principles

Ternarization significantly compresses LLMs.
Heavy-tailed activations limit end-to-end acceleration.
Inter-layer interaction costs prevent quantization cascades.

Method

TWLA employs E2M-ATQ for weight ternarization, KOTMS for reshaping weights and suppressing outliers, and ILA-AMP for mixed-precision bit allocation considering adjacent-layer interactions.

In practice

Achieve 1.58-bit weight and 4-bit activation quantization.
Reduce LLM memory and compute costs.
Enable significant inference acceleration.

Topics

Large Language Models
Post-Training Quantization
Ternary Quantization
Model Compression
Low-Bit Activation
Inference Acceleration

Best for: MLOps Engineer, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.