NVIDIA Blackwell Sets STAC-AI Record for LLM Inference in Finance

· Source: NVIDIA Technical Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, FinTech & Digital Financial Services, Cloud Computing & IT Infrastructure · Depth: Advanced, medium

Summary

The Strategic Technology Analysis Center (STAC) has introduced the STAC-AI benchmark, specifically the LANG6 (Inference-Only) benchmark, to evaluate end-to-end retrieval-augmented generation (RAG) and large language model (LLM) inference pipelines in financial trading. This benchmark uses Llama 3.1 8B Instruct and Llama 3.1 70B Instruct models with custom EDGAR-based datasets (EDGAR4 for medium-length requests and EDGAR5 for long-context requests) to simulate financial summarization and analysis. It assesses performance in both batch (offline) and interactive (online) modes, measuring metrics like throughput, reaction time (RT), and words per second per user (WPS/user). NVIDIA's Blackwell architecture, particularly the GB200 NVL72, demonstrated up to 3.2x performance improvement over Hopper-based systems, achieving higher throughput and better interactivity across various scenarios, setting a new record for LLM inference in the financial sector.

Key takeaway

For CTOs and VPs of Engineering evaluating LLM inference infrastructure for financial applications, the NVIDIA Blackwell architecture, specifically the GB200 NVL72, offers substantial performance gains (up to 3.2x) over Hopper. You should consider upgrading to Blackwell for critical RAG pipelines to achieve superior throughput and interactivity, especially for demanding financial analysis and summarization tasks. Explore the TensorRT LLM Benchmarking Guide to validate performance with your specific datasets.

Key insights

NVIDIA Blackwell architecture significantly improves LLM inference performance for financial RAG workloads, outperforming Hopper by up to 3.2x.

Principles

Method

The STAC-AI LANG6 benchmark evaluates LLM inference using Llama 3.1 models on EDGAR-based financial datasets in batch and interactive modes, measuring throughput, reaction time, and words per second per user.

In practice

Topics

Code references

Best for: CTO, VP of Engineering/Data, Director of AI/ML, MLOps Engineer, AI Engineer, AI Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.