Rewriting model inference with CUDA kernels: the bottleneck was not just GEMM [P]

2026-05-18 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Robotics & Autonomous Systems · Depth: Advanced, short

Summary

A new CUDA-first inference runtime, FlashRT, is being developed to address latency bottlenecks in small-batch, real-time machine learning workloads, particularly in robotics and VLA applications. The project focuses on rewriting model inference paths directly with C++/CUDA kernels, moving beyond generic graph runtimes like PyTorch or TensorRT. This approach targets overheads from fragmented small kernels, norm/residual/activation boundaries, quantize/dequantize operations, layout transitions, and Python/runtime scheduling, which are not effectively hidden by batching in real-time scenarios. Initial results on Jetson Thor and RTX 5090 hardware show latencies such as ~44 ms for Pi0.5 on Jetson Thor and ~2.39 ms/token for Pi0-FAST on RTX 5090, with a target of ~100ms E2E for a Motus world model from a 1.3s baseline. The work also highlights that lower precision, like FP4, does not always yield significant speedups unless deeply fused and applied to large regions, with FP8 proving more consistently useful.

Key takeaway

For AI Architects and Computer Vision Engineers optimizing real-time, small-batch ML models, you should critically evaluate whether generic graph runtimes are sufficient. If your workloads are latency-sensitive with batch size 1, consider directly rewriting inference paths with custom C++/CUDA kernels to mitigate overheads from fragmented operations and memory bandwidth, as lower precision alone may not solve the problem.

Key insights

Small-batch ML inference bottlenecks often stem from runtime glue, not just GEMM, requiring direct kernel optimization.

Principles

Runtime overhead dominates small-batch inference.
Lower precision is not an automatic performance win.
Kernel boundaries prevent effective fusion.

Method

Rewrite model inference paths directly with C++/CUDA kernels to eliminate runtime glue overheads.

In practice

Profile memory bandwidth for small-batch inference.
Evaluate FP8 vs. FP4 for specific model regions.
Consider custom kernels for real-time ML.

Topics

CUDA Kernels
Small-Batch Inference
Realtime ML
FlashRT
Runtime Overhead

Code references

LiangSu8899/FlashRT

Best for: AI Architect, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, AI Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.