Achieving Single-Digit Microsecond Latency Inference for Capital Markets
Summary
NVIDIA's GH200 Grace Hopper Superchip, integrated into a Supermicro ARS-111GL-NHR server, has achieved single-digit microsecond latencies in the STAC-ML Markets (Inference) Tacana benchmark. This performance, audited by STAC, rivals or surpasses specialized hardware like FPGAs and ASICs, which are traditionally used in latency-sensitive algorithmic trading. The benchmark measures LSTM model latency for time series forecasting, using three models (LSTM_A, LSTM_B, LSTM_C) of varying complexity. NVIDIA reported 99th percentile latencies as low as 10 microseconds for LSTM_B and 70 microseconds for LSTM_A, demonstrating consistent performance across multiple model instances. The results highlight the viability of general-purpose GPUs for high-speed financial applications, offering a cost-effective alternative to custom hardware.
Key takeaway
For AI Engineers developing high-frequency trading systems, the demonstrated single-digit microsecond latencies on NVIDIA GPUs mean you can achieve competitive performance without the significant investment in specialized hardware. You should explore the open-source dl-lowlat-infer repository and consider NVIDIA's Blackwell or Hopper architectures to implement and optimize your LSTM models for critical low-latency financial applications.
Key insights
GPUs can achieve single-digit microsecond latencies for deep learning inference in high-frequency trading.
Principles
- Persistent kernels reduce latency by preloading weights.
- Green contexts enable efficient multi-instance GPU serving.
Method
Low-latency LSTM inference on GPUs involves a precomputation step followed by a single-kernel inference for the last time step, using persistent kernels and CPU/GPU atomic synchronization.
In practice
- Use NVIDIA's dl-lowlat-infer for low-latency time series inference.
- Target Blackwell or Hopper architectures for optimal performance.
- Employ GDRCopy for reduced CPU-GPU synchronization overhead.
Topics
- NVIDIA GH200 Grace Hopper Superchip
- STAC-ML Markets Benchmark
- Low-Latency GPU Inference
- Algorithmic Trading
- LSTM Neural Networks
Code references
Best for: Machine Learning Engineer, AI Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.