RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts
Summary
RaMP (Runtime-Aware Megakernel Polymorphism) is a new dispatch framework for Mixture-of-Experts (MoE) inference that addresses the inefficiency of existing production systems, which ignore runtime expert routing distributions. Current systems dispatch kernels based solely on batch size, leading to 10-70% unrealized kernel throughput. RaMP introduces a performance-region analysis, derived from hardware constants, to predict optimal kernel configurations across 8 tested architectures, including 3 unseen. It also employs a four-parameter wave cost model that selects the fastest configuration from the runtime expert histogram, achieving 0.93% mean regret versus exhaustive search after 10-24 minutes of one-time profiling per model. When paired with a co-designed CuTe DSL kernel offering 134-268 polymorphic configurations, RaMP delivers a 1.22x kernel speedup over static dispatch and a 1.30x end-to-end speedup in vLLM serving over Triton, 1.41x over DeepGEMM, and 1.13x over FlashInfer CUTLASS.
Key takeaway
For MLOps Engineers deploying MoE models, adopting routing-aware dispatch like RaMP is crucial for maximizing inference throughput. Your current static dispatch based on batch size alone is likely leaving significant performance on the table, up to 70% in some cases. Implementing a dynamic, physically grounded cost model that considers the actual expert routing distribution can yield substantial speedups, as demonstrated by RaMP's 1.30x end-to-end gain in vLLM.
Key insights
Optimizing MoE inference requires dynamic kernel configuration based on runtime expert routing, not just batch size.
Principles
- Optimal kernel configuration is routing-dependent.
- Physically grounded cost models generalize across architectures.
- Profiling cost scales linearly with optimization dimensions.
Method
RaMP uses a performance-region analysis and a four-parameter wave cost model to dynamically select optimal kernel configurations based on runtime expert histograms, achieving sub-50µs dispatch.
In practice
- Profile MoE kernels at 25 (batch size, balancedness) points.
- Implement a CuTe DSL kernel with diverse configurations.
- Cache dispatch results per step for amortization.
Topics
- Mixture-of-Experts Inference
- Routing-Aware Dispatch
- GPU Kernel Optimization
- Wave Cost Model
- Performance Region Analysis
Code references
Best for: MLOps Engineer, AI Scientist, Research Scientist, Machine Learning Engineer, AI Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.