Rewriting model inference with CUDA kernels: the bottleneck was not just GEMM [P]
Summary
A new CUDA-first inference runtime, FlashRT, is being developed to address latency bottlenecks in small-batch, real-time machine learning workloads, particularly in robotics and VLA applications. The project focuses on rewriting model inference paths directly with C++/CUDA kernels, moving beyond generic graph runtimes like PyTorch or TensorRT. This approach targets overheads from fragmented small kernels, norm/residual/activation boundaries, quantize/dequantize operations, layout transitions, and Python/runtime scheduling, which are not effectively hidden by batching in real-time scenarios. Initial results on Jetson Thor and RTX 5090 hardware show latencies such as ~44 ms for Pi0.5 on Jetson Thor and ~2.39 ms/token for Pi0-FAST on RTX 5090, with a target of ~100ms E2E for a Motus world model from a 1.3s baseline. The work also highlights that lower precision, like FP4, does not always yield significant speedups unless deeply fused and applied to large regions, with FP8 proving more consistently useful.
Key takeaway
For AI Architects and Computer Vision Engineers optimizing real-time, small-batch ML models, you should critically evaluate whether generic graph runtimes are sufficient. If your workloads are latency-sensitive with batch size 1, consider directly rewriting inference paths with custom C++/CUDA kernels to mitigate overheads from fragmented operations and memory bandwidth, as lower precision alone may not solve the problem.
Key insights
Small-batch ML inference bottlenecks often stem from runtime glue, not just GEMM, requiring direct kernel optimization.
Principles
- Runtime overhead dominates small-batch inference.
- Lower precision is not an automatic performance win.
- Kernel boundaries prevent effective fusion.
Method
Rewrite model inference paths directly with C++/CUDA kernels to eliminate runtime glue overheads.
In practice
- Profile memory bandwidth for small-batch inference.
- Evaluate FP8 vs. FP4 for specific model regions.
- Consider custom kernels for real-time ML.
Topics
- CUDA Kernels
- Small-Batch Inference
- Realtime ML
- FlashRT
- Runtime Overhead
Code references
Best for: AI Architect, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, AI Engineer, Robotics Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.