TWLA: Achieving Ternary Weights and Low-Bit Activations for LLMs via Post-Training Quantization

2026-06-11 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

TWLA is a novel post-training quantization (PTQ) framework designed to address the memory and compute costs of Large Language Models (LLMs) by achieving significant compression. Unlike existing methods that struggle with heavy-tailed activation distributions and maintain high-precision activations, TWLA successfully implements 1.58-bit weight compression and 4-bit activation quantization while preserving high accuracy. The framework integrates three key components: the Euclidean-to-Manifold Asymmetric Ternary Quantizer (E2M-ATQ) for minimizing layer-output error during weight ternarization; Kronecker Orthogonal Tri-Modal Shaping (KOTMS) which reshapes weights into ternary-friendly distributions and suppresses activation outliers; and Inter-Layer Aware Activation Mixed Precision (ILA-AMP) for optimizing bit allocation by considering adjacent-layer interactions. Extensive experiments confirm TWLA's ability to maintain high accuracy under W1.58A4, delivering substantial inference acceleration.

Key takeaway

For Machine Learning Engineers deploying Large Language Models with stringent memory and compute constraints, you should evaluate TWLA. This post-training quantization framework enables W1.58A4 compression, significantly reducing model size and accelerating inference without sacrificing accuracy, even with challenging activation distributions. Consider integrating TWLA to achieve substantial cost savings and broader deployment capabilities for your LLM applications.

Key insights

TWLA enables 1.58-bit weights and 4-bit activations for LLMs, overcoming heavy-tailed activation challenges for end-to-end acceleration.

Principles

Minimize layer-output error via two-stage optimization.
Reshape weights into ternary-friendly distributions.
Optimize bit allocation considering adjacent-layer interactions.

Method

TWLA employs E2M-ATQ for weight ternarization, KOTMS for weight reshaping and outlier suppression, and ILA-AMP for inter-layer aware activation mixed precision bit allocation.

In practice

Achieve W1.58A4 compression for LLMs.
Accelerate LLM inference significantly.
Utilize provided code for TWLA implementation.

Topics

Large Language Models
Post-Training Quantization
Ternary Weights
Low-Bit Activations
Model Compression
Inference Acceleration

Code references

Kishon-zzx/TWLA

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.