HARP: Hadamard-Preconditioned Adaptive Rotation Processor for Extreme LLM Quantization
Summary
HARP, or Hadamard-preconditioned Adaptive Rotation Processor, is a novel learnable structured two-sided orthogonal processor designed to enhance extreme low-bit post-training quantization (PTQ) for Large Language Models. Published on 2026-05-28, HARP addresses the limitations of existing fixed randomized Hadamard transforms (RHTs) which struggle with activation outliers and anisotropic weight curvature in 2-4 bit quantization. Unlike RHTs, HARP adapts the quantization basis to each layer and backend by representing rotations as sparse butterfly-like block-orthogonal stages, supporting non-power-of-two dimensions. It initializes to the RHT processor while maintaining exact full-precision equivalence. HARP significantly improves perplexity and zero-shot accuracy across models ranging from 1B to 70B parameters, demonstrating superior performance over fixed RHT. Crucially, it preserves deployment efficiency, achieving 128 tokens/second compared to 61 tokens/second for FP16.
Key takeaway
For MLOps engineers deploying Large Language Models with extreme low-bit quantization, HARP presents a compelling solution to overcome accuracy degradation and performance bottlenecks. You should evaluate HARP for 2-4 bit models, as it demonstrably improves perplexity and zero-shot accuracy over fixed Hadamard transforms while maintaining high deployment efficiency, reaching 128 tokens/second. This allows for more aggressive quantization without sacrificing critical model performance.
Key insights
HARP adaptively rotates LLM weights for extreme low-bit quantization, improving accuracy and inference speed over fixed methods.
Principles
- Adaptive rotation improves low-bit quantization robustness.
- Structured orthogonal processors maintain full-precision equivalence.
- Learnable bases enhance layer-specific quantization.
Method
HARP replaces fixed Hadamard mixing with learnable, sparse butterfly-like block-orthogonal stages. It adapts the quantization basis to each layer and backend using calibration data, supporting mixed-radix schedules for non-power-of-two dimensions.
In practice
- Deploy 2-4 bit LLMs with improved accuracy.
- Achieve 128 tok/s inference speed for quantized models.
- Adapt quantization for diverse LLM architectures.
Topics
- Large Language Models
- Post-training Quantization
- Low-bit Quantization
- Hadamard Transforms
- Model Deployment
- Inference Optimization
Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.