Adaptive Top-K Selection: Eliminating Performance Cliffs Across All K Values on AMD GPUs
Summary
AMD has developed an adaptive Top-K selection strategy within its AITER library, specifically optimized for AMD MI300X GPUs, to eliminate performance cliffs across varying K values in LLM and RAG workloads. The strategy dynamically switches between bitonic-based algorithms for small K values and radix-based algorithms for larger K values. For small K, bitonic sort leverages AMD-specific optimizations like DPP instructions for ultra-low-latency data exchange, med3 for branch-free comparisons, buffer instructions for vectorized memory access, and double buffering to hide memory latency. For large K, radix sort is used due to its fixed histogram processing cost. An empirically derived formula, tuned for MI300X, determines the optimal switching threshold based on both K and sequence length n, ensuring consistent peak performance across diverse workload configurations.
Key takeaway
For AI Engineers optimizing LLM and RAG inference on AMD MI300X GPUs, adopting the AITER library's AdaptiveTopK implementation is crucial. This strategy automatically selects the most efficient Top-K algorithm, eliminating performance bottlenecks for both small and large K values. You should integrate this solution to ensure optimal throughput and reduced latency without manual algorithm tuning, especially when dealing with varying sequence lengths and K-selection requirements.
Key insights
Adaptive Top-K selection on AMD GPUs dynamically optimizes performance by switching between bitonic and radix sort based on K and sequence length.
Principles
- Fixed overheads disproportionately impact small-K performance.
- Overlap memory loads with computation to hide latency.
- Hardware-specific instructions reduce instruction count and latency.
Method
An adaptive strategy selects between bitonic sort (for small K, leveraging DPP, med3, buffer instructions, double buffering) and radix sort (for large K) based on an empirically tuned formula considering K and sequence length.
In practice
- Utilize AMD's AITER library for optimized Top-K selection.
- Implement double buffering to hide global memory latency.
- Employ DPP and med3 instructions for intra-warp communication.
Topics
- Top-K Selection
- AMD MI300X GPUs
- Adaptive Algorithms
- Bitonic Sort
- Performance Optimization
Code references
Best for: Machine Learning Engineer, AI Engineer, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.