Achieving Single-Digit Microsecond Latency Inference for Capital Markets

2026-04-02 · Source: NVIDIA Technical Blog · Field: Finance & Economics — Capital Markets & Investment Management, FinTech & Digital Financial Services · Depth: Advanced, long

Summary

NVIDIA's GH200 Grace Hopper Superchip, integrated into a Supermicro ARS-111GL-NHR server, has achieved single-digit microsecond latencies in the STAC-ML Markets (Inference) Tacana benchmark. This performance, audited by STAC, rivals or surpasses specialized hardware like FPGAs and ASICs, which are traditionally used in latency-sensitive algorithmic trading. The benchmark measures LSTM model latency for time series forecasting, using three models (LSTM_A, LSTM_B, LSTM_C) of varying complexity. NVIDIA reported 99th percentile latencies as low as 10 microseconds for LSTM_B and 70 microseconds for LSTM_A, demonstrating consistent performance across multiple model instances. The results highlight the viability of general-purpose GPUs for high-speed financial applications, offering a cost-effective alternative to custom hardware.

Key takeaway

For AI Engineers developing high-frequency trading systems, the demonstrated single-digit microsecond latencies on NVIDIA GPUs mean you can achieve competitive performance without the significant investment in specialized hardware. You should explore the open-source dl-lowlat-infer repository and consider NVIDIA's Blackwell or Hopper architectures to implement and optimize your LSTM models for critical low-latency financial applications.

Key insights

GPUs can achieve single-digit microsecond latencies for deep learning inference in high-frequency trading.

Principles

Persistent kernels reduce latency by preloading weights.
Green contexts enable efficient multi-instance GPU serving.

Method

Low-latency LSTM inference on GPUs involves a precomputation step followed by a single-kernel inference for the last time step, using persistent kernels and CPU/GPU atomic synchronization.

In practice

Use NVIDIA's dl-lowlat-infer for low-latency time series inference.
Target Blackwell or Hopper architectures for optimal performance.
Employ GDRCopy for reduced CPU-GPU synchronization overhead.

Topics

NVIDIA GH200 Grace Hopper Superchip
STAC-ML Markets Benchmark
Low-Latency GPU Inference
Algorithmic Trading
LSTM Neural Networks

Code references

Best for: Machine Learning Engineer, AI Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.