Making Softmax More Efficient with NVIDIA Blackwell Ultra

2026-02-25 · Source: NVIDIA Technical Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, medium

Summary

NVIDIA Blackwell Ultra architecture significantly enhances Large Language Model (LLM) inference performance by doubling the throughput of Special Function Units (SFUs) for transcendental calculations, specifically the natural exponential function (`MUFU.EX2`) used in softmax operations. This optimization addresses a critical bottleneck in attention mechanisms, where SFUs previously stalled Tensor Cores during the normalization of attention scores. The article details how this hardware improvement reduces softmax latency, allowing Tensor Cores to maintain higher utilization and leading to faster overall forward propagation (FPROP). Benchmarks show Blackwell Ultra (GB300) achieves approximately 2x higher FLOPs performance for `MUFU.EX2` over standard Blackwell (GB200) across various data types, and a ~35% increase in end-to-end FPROP throughput for FP8 operations in models like DeepSeek-V3.

Key takeaway

For AI Engineers and Deep Learning Architects optimizing LLM inference, understanding the impact of non-linear operations is crucial. Blackwell Ultra's doubled SFU throughput directly addresses the softmax bottleneck, yielding up to a 35% FPROP gain in FP8. You should consider upgrading to Blackwell Ultra systems to maximize throughput for models with complex attention schemes and long context windows, ensuring balanced utilization of both Tensor Cores and SFUs.

Key insights

Doubling SFU throughput in Blackwell Ultra significantly accelerates LLM attention mechanisms by resolving softmax bottlenecks.

Principles

Softmax is a critical bottleneck in long-context AI.
Balanced hardware throughput is essential for inference.
Attention mechanisms dynamically re-weight token information.

Method

The Blackwell Ultra architecture doubles SFU throughput for `MUFU.EX2` instructions, reducing softmax execution time and minimizing Tensor Core idle periods within the attention loop's BMM1-Softmax-BMM2 pipeline.

In practice

Benchmark `MUFU.EX2` performance using provided kernel code.
Utilize Blackwell Ultra for attention-heavy LLM workloads.
Explore NVIDIA's trtllm-gen repository for optimizations.

Topics

NVIDIA Blackwell Ultra
Softmax Optimization
LLM Inference
Attention Mechanism
Special Function Units

Code references

jamieliNVIDIA/mufu_ex2_bench

Best for: AI Engineer, Deep Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.