Controlling Floating-Point Determinism in NVIDIA CCCL
Summary
NVIDIA CUDA Core Compute Libraries (CCCL) 3.1 introduces a new single-phase API in CUB, a low-level CUDA library, allowing users to configure the determinism of reduction algorithms. This addresses the challenge of achieving consistent bitwise results in parallel programming and floating-point arithmetic, where non-associativity due to finite precision can lead to varying outcomes. The new API offers three determinism levels: `not_guaranteed`, `run_to_run`, and `gpu_to_gpu`. `Not_guaranteed` provides the highest performance by allowing atomic operations and single kernel launches, but results may vary between runs. `Run_to_run` (the default) ensures identical results on the same GPU across multiple invocations by using a fixed hierarchical tree reduction. `GPU-to-GPU` offers the strictest reproducibility across different GPUs by employing a Reproducible Floating-point Accumulator (RFA) with three exponent bins, which can increase execution time by 20% to 30% for large problems.
Key takeaway
For AI Engineers optimizing CUDA-accelerated applications, understanding CUB's new determinism levels is crucial. If your application tolerates minor floating-point variations, using `not_guaranteed` can significantly boost performance, especially for smaller input arrays. However, if strict reproducibility is paramount, particularly across different GPUs, you should opt for `gpu_to_gpu` determinism, being mindful of the potential 20-30% performance overhead for large problem sizes. Evaluate your specific needs to choose the optimal balance.
Key insights
CUB's new API offers explicit determinism levels for CUDA reductions, balancing performance and reproducibility.
Principles
- Floating-point non-associativity impacts parallel reduction determinism.
- Higher determinism often correlates with reduced performance.
Method
Configure `cub::DeviceReduce::Sum` using `cuda::execution::require(cuda::execution::determinism::[level])` within the new single-phase API to set the desired determinism level.
In practice
- Use `not_guaranteed` for maximum performance where minor result variations are acceptable.
- Employ `run_to_run` for consistent results on a single GPU.
- Select `gpu_to_gpu` for identical results across different GPUs.
Topics
- CUDA Libraries
- CUB
- Parallel Reduction Algorithms
- Floating-Point Determinism
- Reproducible Floating-point Accumulator
Code references
Best for: AI Engineer, Machine Learning Engineer, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.