Further Accelerating Kimi-K2.5 on AMD Instinct™ MI325X: W4A8 & W8A8 Quantization with AMD Quark
Summary
This article details two quantization strategies, W4A8 and W8A8, for accelerating Kimi-K2.5 inference on AMD Instinct™ MI325X GPUs, building upon a previous W4A16 optimization. The W4A8 strategy, using INT4 weights and INT8 activations, involves re-quantizing Kimi-K2.5's Mixture-of-Experts (MoE) layers with AMD Quark's `ProgressiveSpec` two-stage quantization to a per-channel layout, and extending the FlyDSL fused MoE kernel to utilize 8-bit MFMA instructions. This yielded up to 16% lower TPOT and 18% higher throughput at low concurrency, and 10% lower TPOT at high concurrency, with no accuracy loss on GSM8K. The W8A8 strategy, employing FP8 weights and FP8 activations, uses a simpler single-stage Quark quantization and AITER's native CK/ASM FP8 kernels. W8A8 achieved the best per-token latency at low concurrency (14.25 ms TPOT, 16% faster than W4A8) and the highest accuracy (94.24% on GSM8K), but at the cost of 2x memory per GPU (~124 GB vs ~62 GB for W4A8).
Key takeaway
For AI Engineers optimizing Kimi-K2.5 inference on AMD Instinct™ MI325X, consider W4A8 quantization for high-concurrency, bandwidth-bound scenarios to achieve up to 18% higher throughput with minimal memory footprint. If your workload is low-concurrency and compute-bound, or if maximum accuracy is paramount and you can accommodate ~124 GB/GPU, W8A8 offers the lowest per-token latency and highest accuracy. Evaluate your specific batch size distribution and HBM capacity to select the optimal strategy.
Key insights
Quantization to W4A8 or W8A8 significantly accelerates Kimi-K2.5 inference on AMD MI325X GPUs with minimal accuracy impact.
Principles
- INT8/FP8 MFMA offers 2x throughput over BF16 on MI325X.
- Dynamic per-token activation quantization minimizes error.
- Per-channel weight scales are required for efficient INT8 MFMA.
Method
AMD Quark's `ProgressiveSpec` performs two-stage weight quantization (FP8 outlier clipping then INT4 per-channel) for W4A8, while `FP8E4M3PerChannelSpec` is used for W8A8.
In practice
- Use `amd/Kimi-K2.5-W4A8` for high concurrency, memory-bound workloads.
- Use `ginsongsong/Kimi-K2.5-W8A8` for low concurrency, compute-bound tasks.
- Exclude attention and non-expert layers from quantization for accuracy.
Topics
- Kimi-K2.5
- AMD Instinct MI325X
- W4A8 Quantization
- W8A8 Quantization
- LLM Inference Optimization
Code references
Best for: Machine Learning Engineer, AI Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.