HARP: Hadamard-Preconditioned Adaptive Rotation Processor for Extreme LLM Quantization

2026-05-28 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

HARP, or Hadamard-preconditioned Adaptive Rotation Processor, is a novel learnable structured two-sided orthogonal processor designed to enhance extreme low-bit post-training quantization (PTQ) for Large Language Models. Published on 2026-05-28, HARP addresses the limitations of existing fixed randomized Hadamard transforms (RHTs) which struggle with activation outliers and anisotropic weight curvature in 2-4 bit quantization. Unlike RHTs, HARP adapts the quantization basis to each layer and backend by representing rotations as sparse butterfly-like block-orthogonal stages, supporting non-power-of-two dimensions. It initializes to the RHT processor while maintaining exact full-precision equivalence. HARP significantly improves perplexity and zero-shot accuracy across models ranging from 1B to 70B parameters, demonstrating superior performance over fixed RHT. Crucially, it preserves deployment efficiency, achieving 128 tokens/second compared to 61 tokens/second for FP16.

Key takeaway

For MLOps engineers deploying Large Language Models with extreme low-bit quantization, HARP presents a compelling solution to overcome accuracy degradation and performance bottlenecks. You should evaluate HARP for 2-4 bit models, as it demonstrably improves perplexity and zero-shot accuracy over fixed Hadamard transforms while maintaining high deployment efficiency, reaching 128 tokens/second. This allows for more aggressive quantization without sacrificing critical model performance.

Key insights

HARP adaptively rotates LLM weights for extreme low-bit quantization, improving accuracy and inference speed over fixed methods.

Principles

Adaptive rotation improves low-bit quantization robustness.
Structured orthogonal processors maintain full-precision equivalence.
Learnable bases enhance layer-specific quantization.

Method

HARP replaces fixed Hadamard mixing with learnable, sparse butterfly-like block-orthogonal stages. It adapts the quantization basis to each layer and backend using calibration data, supporting mixed-radix schedules for non-power-of-two dimensions.

In practice

Deploy 2-4 bit LLMs with improved accuracy.
Achieve 128 tok/s inference speed for quantized models.
Adapt quantization for diverse LLM architectures.

Topics

Large Language Models
Post-training Quantization
Low-bit Quantization
Hadamard Transforms
Model Deployment
Inference Optimization

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.