Faster Kimi-K2.5-W4A8 Decoding with EAGLE3 on AMD Instinct™ MI325X

· Source: AMD ROCm Blogs · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, long

Summary

AMD has significantly accelerated Kimi-K2.5-W4A8 decoding on its Instinct™ MI325X GPUs by integrating EAGLE3 speculative decoding and applying targeted kernel optimizations. On an 8× MI325X setup at concurrency=40, this combined approach reduced decode-step latency (TPOT median) from 42.73 ms to 27.41 ms, a 35.9% improvement, and boosted output throughput from 672 to 895 tokens/second, a 33.1% increase. The majority of this performance gain, approximately 35% TPOT reduction and 30% throughput increase, stems from EAGLE3, which uses a 3B-parameter BF16 draft model to propose a 4-token chain, achieving an accept length of about 3.93 / 4.0. Three specific kernel patches, including Stage2 MoE "tile_k=256" and a Stage1 MoE scheduler-hint gate, contributed an additional 1-2% TPOT and 2-3% throughput. This optimization stack maintained accuracy, showing no measurable GSM8K regression.

Key takeaway

For AI Engineers optimizing Kimi-K2.5-W4A8 inference on AMD Instinct™ MI325X GPUs, integrating EAGLE3 speculative decoding is crucial. You can achieve up to 35.9% lower decode latency and 33.1% higher throughput without accuracy loss. Implement the "lightseekorg/kimi-k2.5-eagle3" draft model with "--speculative-num-draft-tokens 4" and apply the recommended kernel tuning, such as "AITER_FLYDSL_STAGE2_TILE_K=256", to maximize your serving efficiency and reduce operational costs.

Key insights

EAGLE3 speculative decoding significantly boosts LLM inference throughput on AMD MI325X by parallelizing token verification.

Principles

Method

Integrate EAGLE3 speculative decoding with Kimi-K2.5-W4A8, using a 3B-parameter draft model and "--speculative-num-draft-tokens 4". Apply kernel patches for M=4 verify shape, including "tile_k=256" and scheduler-hint gate.

In practice

Topics

Code references

Best for: Machine Learning Engineer, AI Engineer, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.