DynamicPTQ: Mitigating Activation Quantization Collapse via Residual-Stream Dynamics

2026-06-10 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

DynamicPTQ is a novel Post-Training Quantization (PTQ) policy designed to mitigate activation quantization collapse in large language models, particularly when weights, activations, and KV caches are all quantized to 4-bit precision. It addresses the challenge of "massive activations" whose extreme values amplify quantization errors, a problem often overlooked by static transformation-based smoothing methods. DynamicPTQ analyzes residual-stream dynamics, identifying quantization-sensitive layers where massive activations emerge and disappear in a phase-wise pattern. It then assigns 8-bit activation precision only to these critical layers, maintaining 4-bit precision elsewhere. Experiments on LLaMA-2 and LLaMA-3 demonstrate consistent improvements in perplexity and zero-shot QA performance under W4A4KV4 quantization, achieving 1.05 to 1.07 times throughput gains with modest memory overhead.

Key takeaway

For Machine Learning Engineers optimizing large language model inference, DynamicPTQ provides a practical solution to robust W4A4KV4 quantization. By dynamically assigning 8-bit activation precision to sensitive layers based on residual-stream dynamics, you can achieve improved perplexity and zero-shot QA performance on models like LLaMA-2 and LLaMA-3, alongside 1.05 to 1.07 times throughput gains with modest memory overhead. Evaluate integrating this policy with your existing PTQ baselines for more efficient deployment.

Key insights

DynamicPTQ uses residual-stream dynamics to identify and selectively apply 8-bit activation quantization, improving low-bit LLM inference.

Principles

Massive activations emerge phase-wise across network depth.
Cross-layer residual changes cause dynamic quantization instability.
Static smoothing methods cannot fully resolve dynamic instability.

Method

DynamicPTQ identifies quantization-sensitive layers from residual-stream dynamics using Jump Ratio and Historical Feature SNR, then assigns 8-bit activation precision only to these layers.

In practice

Integrate DynamicPTQ with QuaRot, SpinQuant, or FlatQuant.
Apply 8-bit activation precision only to sensitive layers.
Achieve W4A4KV4 quantization with improved throughput.

Topics

Post-training Quantization
LLM Inference Optimization
Activation Quantization
Residual Stream Dynamics
Mixed-Precision Quantization
LLaMA Models

Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.