A Hardware-Aware, Per-Layer Methodology for Post-Training Quantization of Large Language Models

2026-05-14 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, AI Hardware · Depth: Expert, quick

Summary

Scaled Outer Product (SOP) is a post-training quantization method for large language model weights, targeting near-lossless fidelity at 4.5-6 bits per weight on hardware supporting per-layer LUT decode. This methodology integrates per-layer search of fixed and dynamic codebook pairs, selected via a per-block selection bit, alongside signed per-block scales, activation-weighted cosine selection, and multiple-choice knapsack promotion for sensitive layers. It also incorporates outlier and sparse-residual correction. Fixed codebooks include NF4, BOF4, Split87, and SH4, while per-layer optimized codebooks (DD4) reside in LUT SRAM. A new hardware-efficient LUT output format (HIF) is introduced to enhance performance, energy efficiency, and cost. The recommended FP6 operating point (E2M3sUE4M4, 6.5 bpw) achieves lower weight reconstruction error than the FP8 baseline (E4M3, 8.0 bpw) at 1.5 bpw lower storage cost across six open model families.

Key takeaway

For AI Engineers optimizing large language model deployment, SOP offers a compelling alternative to conventional FP8 quantization. By adopting the FP6 operating point (E2M3sUE4M4), you can achieve superior weight reconstruction error at a 1.5 bpw lower storage cost, making models more efficient for hardware with per-layer LUT decode. Evaluate the HIF format to further enhance performance and reduce energy consumption in your specific hardware configurations.

Key insights

SOP quantization achieves near-lossless LLM fidelity at 4.5-6 bpw using hardware-aware, per-layer codebook optimization.

Principles

Block-scaled small atoms can replace FP8.
Per-layer LUT decode is key for fidelity.
Hardware-aware design improves efficiency.

Method

SOP combines per-layer search of fixed/dynamic codebooks, per-block selection, signed scales, activation-weighted cosine selection, and knapsack promotion for sensitive layers, with outlier/sparse-residual correction.

In practice

Utilize FP6 (E2M3sUE4M4) for lower storage.
Consider HIF for improved hardware performance.
Apply layer promotion for sensitive layers.

Topics

Scaled Outer Product
Post-Training Quantization
Large Language Models
Hardware-Aware Quantization
LUT Decode

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.