Further Accelerating Kimi-K2.5 on AMD Instinct™ MI325X: W4A8 & W8A8 Quantization with AMD Quark

2026-05-14 · Source: AMD ROCm Blogs · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

This article details two quantization strategies, W4A8 and W8A8, for accelerating Kimi-K2.5 inference on AMD Instinct™ MI325X GPUs, building upon a previous W4A16 optimization. The W4A8 strategy, using INT4 weights and INT8 activations, involves re-quantizing Kimi-K2.5's Mixture-of-Experts (MoE) layers with AMD Quark's `ProgressiveSpec` two-stage quantization to a per-channel layout, and extending the FlyDSL fused MoE kernel to utilize 8-bit MFMA instructions. This yielded up to 16% lower TPOT and 18% higher throughput at low concurrency, and 10% lower TPOT at high concurrency, with no accuracy loss on GSM8K. The W8A8 strategy, employing FP8 weights and FP8 activations, uses a simpler single-stage Quark quantization and AITER's native CK/ASM FP8 kernels. W8A8 achieved the best per-token latency at low concurrency (14.25 ms TPOT, 16% faster than W4A8) and the highest accuracy (94.24% on GSM8K), but at the cost of 2x memory per GPU (~124 GB vs ~62 GB for W4A8).

Key takeaway

For AI Engineers optimizing Kimi-K2.5 inference on AMD Instinct™ MI325X, consider W4A8 quantization for high-concurrency, bandwidth-bound scenarios to achieve up to 18% higher throughput with minimal memory footprint. If your workload is low-concurrency and compute-bound, or if maximum accuracy is paramount and you can accommodate ~124 GB/GPU, W8A8 offers the lowest per-token latency and highest accuracy. Evaluate your specific batch size distribution and HBM capacity to select the optimal strategy.

Key insights

Quantization to W4A8 or W8A8 significantly accelerates Kimi-K2.5 inference on AMD MI325X GPUs with minimal accuracy impact.

Principles

INT8/FP8 MFMA offers 2x throughput over BF16 on MI325X.
Dynamic per-token activation quantization minimizes error.
Per-channel weight scales are required for efficient INT8 MFMA.

Method

AMD Quark's `ProgressiveSpec` performs two-stage weight quantization (FP8 outlier clipping then INT4 per-channel) for W4A8, while `FP8E4M3PerChannelSpec` is used for W8A8.

In practice

Use `amd/Kimi-K2.5-W4A8` for high concurrency, memory-bound workloads.
Use `ginsongsong/Kimi-K2.5-W8A8` for low concurrency, compute-bound tasks.
Exclude attention and non-expert layers from quantization for accuracy.

Topics

Kimi-K2.5
AMD Instinct MI325X
W4A8 Quantization
W8A8 Quantization
LLM Inference Optimization

Code references

Best for: Machine Learning Engineer, AI Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.