NVIDIA Blackwell Sets STAC-AI Record for LLM Inference in Finance

2026-03-05 · Source: NVIDIA Technical Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, FinTech & Digital Financial Services, Cloud Computing & IT Infrastructure · Depth: Advanced, medium

Summary

The Strategic Technology Analysis Center (STAC) has introduced the STAC-AI benchmark, specifically the LANG6 (Inference-Only) benchmark, to evaluate end-to-end retrieval-augmented generation (RAG) and large language model (LLM) inference pipelines in financial trading. This benchmark uses Llama 3.1 8B Instruct and Llama 3.1 70B Instruct models with custom EDGAR-based datasets (EDGAR4 for medium-length requests and EDGAR5 for long-context requests) to simulate financial summarization and analysis. It assesses performance in both batch (offline) and interactive (online) modes, measuring metrics like throughput, reaction time (RT), and words per second per user (WPS/user). NVIDIA's Blackwell architecture, particularly the GB200 NVL72, demonstrated up to 3.2x performance improvement over Hopper-based systems, achieving higher throughput and better interactivity across various scenarios, setting a new record for LLM inference in the financial sector.

Key takeaway

For CTOs and VPs of Engineering evaluating LLM inference infrastructure for financial applications, the NVIDIA Blackwell architecture, specifically the GB200 NVL72, offers substantial performance gains (up to 3.2x) over Hopper. You should consider upgrading to Blackwell for critical RAG pipelines to achieve superior throughput and interactivity, especially for demanding financial analysis and summarization tasks. Explore the TensorRT LLM Benchmarking Guide to validate performance with your specific datasets.

Key insights

NVIDIA Blackwell architecture significantly improves LLM inference performance for financial RAG workloads, outperforming Hopper by up to 3.2x.

Principles

Quantization to FP8 (Hopper) or NVFP4 (Blackwell) optimizes LLM inference.
Server-side tokenization protects system prompts and adds CPU load.

Method

The STAC-AI LANG6 benchmark evaluates LLM inference using Llama 3.1 models on EDGAR-based financial datasets in batch and interactive modes, measuring throughput, reaction time, and words per second per user.

In practice

Use NVIDIA TensorRT LLM for efficient model execution.
Benchmark custom models with TensorRT LLM using Docker and Model Optimizer.
Generate synthetic datasets to simulate specific token distributions.

Topics

LLM Inference Benchmarking
NVIDIA Blackwell Architecture
Financial LLMs
TensorRT LLM
Retrieval-Augmented Generation

Code references

Best for: CTO, VP of Engineering/Data, Director of AI/ML, MLOps Engineer, AI Engineer, AI Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.