NVIDIA Blackwell Sets STAC-AI Record for LLM Inference in Finance
Summary
The Strategic Technology Analysis Center (STAC) has introduced the STAC-AI benchmark, specifically the LANG6 (Inference-Only) benchmark, to evaluate end-to-end retrieval-augmented generation (RAG) and large language model (LLM) inference pipelines in financial trading. This benchmark uses Llama 3.1 8B Instruct and Llama 3.1 70B Instruct models with custom EDGAR-based datasets (EDGAR4 for medium-length requests and EDGAR5 for long-context requests) to simulate financial summarization and analysis. It assesses performance in both batch (offline) and interactive (online) modes, measuring metrics like throughput, reaction time (RT), and words per second per user (WPS/user). NVIDIA's Blackwell architecture, particularly the GB200 NVL72, demonstrated up to 3.2x performance improvement over Hopper-based systems, achieving higher throughput and better interactivity across various scenarios, setting a new record for LLM inference in the financial sector.
Key takeaway
For CTOs and VPs of Engineering evaluating LLM inference infrastructure for financial applications, the NVIDIA Blackwell architecture, specifically the GB200 NVL72, offers substantial performance gains (up to 3.2x) over Hopper. You should consider upgrading to Blackwell for critical RAG pipelines to achieve superior throughput and interactivity, especially for demanding financial analysis and summarization tasks. Explore the TensorRT LLM Benchmarking Guide to validate performance with your specific datasets.
Key insights
NVIDIA Blackwell architecture significantly improves LLM inference performance for financial RAG workloads, outperforming Hopper by up to 3.2x.
Principles
- Quantization to FP8 (Hopper) or NVFP4 (Blackwell) optimizes LLM inference.
- Server-side tokenization protects system prompts and adds CPU load.
Method
The STAC-AI LANG6 benchmark evaluates LLM inference using Llama 3.1 models on EDGAR-based financial datasets in batch and interactive modes, measuring throughput, reaction time, and words per second per user.
In practice
- Use NVIDIA TensorRT LLM for efficient model execution.
- Benchmark custom models with TensorRT LLM using Docker and Model Optimizer.
- Generate synthetic datasets to simulate specific token distributions.
Topics
- LLM Inference Benchmarking
- NVIDIA Blackwell Architecture
- Financial LLMs
- TensorRT LLM
- Retrieval-Augmented Generation
Code references
Best for: CTO, VP of Engineering/Data, Director of AI/ML, MLOps Engineer, AI Engineer, AI Data Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.