Faster Kimi-K2.5-W4A8 Decoding with EAGLE3 on AMD Instinct™ MI325X
Summary
AMD has significantly accelerated Kimi-K2.5-W4A8 decoding on its Instinct™ MI325X GPUs by integrating EAGLE3 speculative decoding and applying targeted kernel optimizations. On an 8× MI325X setup at concurrency=40, this combined approach reduced decode-step latency (TPOT median) from 42.73 ms to 27.41 ms, a 35.9% improvement, and boosted output throughput from 672 to 895 tokens/second, a 33.1% increase. The majority of this performance gain, approximately 35% TPOT reduction and 30% throughput increase, stems from EAGLE3, which uses a 3B-parameter BF16 draft model to propose a 4-token chain, achieving an accept length of about 3.93 / 4.0. Three specific kernel patches, including Stage2 MoE "tile_k=256" and a Stage1 MoE scheduler-hint gate, contributed an additional 1-2% TPOT and 2-3% throughput. This optimization stack maintained accuracy, showing no measurable GSM8K regression.
Key takeaway
For AI Engineers optimizing Kimi-K2.5-W4A8 inference on AMD Instinct™ MI325X GPUs, integrating EAGLE3 speculative decoding is crucial. You can achieve up to 35.9% lower decode latency and 33.1% higher throughput without accuracy loss. Implement the "lightseekorg/kimi-k2.5-eagle3" draft model with "--speculative-num-draft-tokens 4" and apply the recommended kernel tuning, such as "AITER_FLYDSL_STAGE2_TILE_K=256", to maximize your serving efficiency and reduce operational costs.
Key insights
EAGLE3 speculative decoding significantly boosts LLM inference throughput on AMD MI325X by parallelizing token verification.
Principles
- Speculative decoding amortizes target model cost.
- Tree-based drafts improve accept length.
- Shape-aware kernel tuning refines performance.
Method
Integrate EAGLE3 speculative decoding with Kimi-K2.5-W4A8, using a 3B-parameter draft model and "--speculative-num-draft-tokens 4". Apply kernel patches for M=4 verify shape, including "tile_k=256" and scheduler-hint gate.
In practice
- Use "lightseekorg/kimi-k2.5-eagle3" draft model.
- Set "--speculative-num-draft-tokens 4" for EAGLE3.
- Apply "AITER_FLYDSL_STAGE2_TILE_K=256" for MoE.
Topics
- Kimi-K2.5
- Speculative Decoding
- EAGLE3
- AMD Instinct MI325X
- LLM Inference
- Kernel Optimization
Code references
Best for: Machine Learning Engineer, AI Engineer, AI Hardware Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.