Controlling Floating-Point Determinism in NVIDIA CCCL

2026-03-05 · Source: NVIDIA Technical Blog · Field: Technology & Digital — Software Development & Engineering, Artificial Intelligence & Machine Learning · Depth: Advanced, short

Summary

NVIDIA CUDA Core Compute Libraries (CCCL) 3.1 introduces a new single-phase API in CUB, a low-level CUDA library, allowing users to configure the determinism of reduction algorithms. This addresses the challenge of achieving consistent bitwise results in parallel programming and floating-point arithmetic, where non-associativity due to finite precision can lead to varying outcomes. The new API offers three determinism levels: `not_guaranteed`, `run_to_run`, and `gpu_to_gpu`. `Not_guaranteed` provides the highest performance by allowing atomic operations and single kernel launches, but results may vary between runs. `Run_to_run` (the default) ensures identical results on the same GPU across multiple invocations by using a fixed hierarchical tree reduction. `GPU-to-GPU` offers the strictest reproducibility across different GPUs by employing a Reproducible Floating-point Accumulator (RFA) with three exponent bins, which can increase execution time by 20% to 30% for large problems.

Key takeaway

For AI Engineers optimizing CUDA-accelerated applications, understanding CUB's new determinism levels is crucial. If your application tolerates minor floating-point variations, using `not_guaranteed` can significantly boost performance, especially for smaller input arrays. However, if strict reproducibility is paramount, particularly across different GPUs, you should opt for `gpu_to_gpu` determinism, being mindful of the potential 20-30% performance overhead for large problem sizes. Evaluate your specific needs to choose the optimal balance.

Key insights

CUB's new API offers explicit determinism levels for CUDA reductions, balancing performance and reproducibility.

Principles

Floating-point non-associativity impacts parallel reduction determinism.
Higher determinism often correlates with reduced performance.

Method

Configure `cub::DeviceReduce::Sum` using `cuda::execution::require(cuda::execution::determinism::[level])` within the new single-phase API to set the desired determinism level.

In practice

Use `not_guaranteed` for maximum performance where minor result variations are acceptable.
Employ `run_to_run` for consistent results on a single GPU.
Select `gpu_to_gpu` for identical results across different GPUs.

Topics

CUDA Libraries
CUB
Parallel Reduction Algorithms
Floating-Point Determinism
Reproducible Floating-point Accumulator

Code references

NVIDIA/cccl

Best for: AI Engineer, Machine Learning Engineer, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.