RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts

· Source: cs.LG updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

RaMP (Runtime-Aware Megakernel Polymorphism) is a new dispatch framework for Mixture-of-Experts (MoE) inference that addresses the inefficiency of existing production systems, which ignore runtime expert routing distributions. Current systems dispatch kernels based solely on batch size, leading to 10-70% unrealized kernel throughput. RaMP introduces a performance-region analysis, derived from hardware constants, to predict optimal kernel configurations across 8 tested architectures, including 3 unseen. It also employs a four-parameter wave cost model that selects the fastest configuration from the runtime expert histogram, achieving 0.93% mean regret versus exhaustive search after 10-24 minutes of one-time profiling per model. When paired with a co-designed CuTe DSL kernel offering 134-268 polymorphic configurations, RaMP delivers a 1.22x kernel speedup over static dispatch and a 1.30x end-to-end speedup in vLLM serving over Triton, 1.41x over DeepGEMM, and 1.13x over FlashInfer CUTLASS.

Key takeaway

For MLOps Engineers deploying MoE models, adopting routing-aware dispatch like RaMP is crucial for maximizing inference throughput. Your current static dispatch based on batch size alone is likely leaving significant performance on the table, up to 70% in some cases. Implementing a dynamic, physically grounded cost model that considers the actual expert routing distribution can yield substantial speedups, as demonstrated by RaMP's 1.30x end-to-end gain in vLLM.

Key insights

Optimizing MoE inference requires dynamic kernel configuration based on runtime expert routing, not just batch size.

Principles

Method

RaMP uses a performance-region analysis and a four-parameter wave cost model to dynamically select optimal kernel configurations based on runtime expert histograms, achieving sub-50µs dispatch.

In practice

Topics

Code references

Best for: MLOps Engineer, AI Scientist, Research Scientist, Machine Learning Engineer, AI Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.