CUDA 13.2 Introduces Enhanced CUDA Tile Support and New Python Features

2026-03-09 · Source: NVIDIA Technical Blog · Field: Technology & Digital — Software Development & Engineering, Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Advanced, long

Summary

NVIDIA has released CUDA 13.2, featuring significant updates for developer productivity across Python and C++ environments. Key enhancements include expanded support for NVIDIA CUDA Tile on compute capability 8.X, 10.X, and 12.X architectures (Ampere, Ada, Blackwell), with cuTile Python offering new language features like recursive functions and custom reductions. The release introduces simplified `memcpy` APIs with attributes, reduced per-context local memory footprint on Windows, and new APIs to query memory pool properties. Windows compute drivers now default to MCDM instead of TCC for improved compatibility and advanced memory management. CUDA 13.2 also adds `CUDA_DISABLE_PERF_BOOST` for power savings, a polymorphic `cudaGraphNodeGetParams` function, and compiler updates including Visual Studio 2026 support. Embedded devices benefit from unified CUDA for Arm and Multi-Instance GPU (MIG) support on Jetson Thor for workload isolation. Math libraries like cuBLAS and cuSOLVER receive updates, with cuBLAS supporting MXFP8 Grouped GEMM on Blackwell and cuSOLVER introducing FP64-emulated calculations for up to 2x speedups. Developer tools are bolstered with NVIDIA Nsight Python for kernel profiling, Numba-CUDA debugging, and Nsight Compute 2026.1 enhancements. CCCL 3.2 provides modern C++ interfaces for core CUDA and new algorithms like `cub::DeviceTopK` (up to 5x faster) and optimized `cub::DeviceSegmentedReduce` (up to 66x faster).

Key takeaway

For NLP Engineers and Computer Vision Engineers optimizing GPU workloads, CUDA 13.2 offers critical performance and productivity gains. You should explore the new cuTile Python features for enhanced language support and consider adopting the modern C++ interfaces in CCCL 3.2 for safer, more efficient GPU programming. Leveraging `cub::DeviceTopK` and `cub::DeviceSegmentedReduce` can yield significant speedups for specific algorithms, while Nsight Python and Numba-CUDA debugging will streamline your development and optimization workflows.

Key insights

CUDA 13.2 unifies and optimizes GPU programming across Python and C++ with new features for performance, debugging, and embedded systems.

Principles

Prioritize developer productivity through simplified APIs.
Enhance performance via specialized algorithms and emulation.
Ensure compatibility and isolation for diverse computing environments.

Method

CUDA 13.2 introduces `cudaMemcpyWithAttributesAsync` for simplified memory transfers, `cudaMemPoolGetAttribute` for querying memory pool properties, and `cudaGraphNodeGetParams` for CUDA Graphs, alongside new `cub::DeviceTopK` and `cub::DeviceSegmentedReduce` algorithms.

In practice

Use `pip install cuda-tile[tileiras]` for cuTile Python.
Employ `CUDA_DISABLE_PERF_BOOST=1` for power savings.
Utilize NVIDIA Nsight Python for kernel performance analysis.

Topics

CUDA Toolkit
GPU Programming
Python for CUDA
Performance Optimization
Developer Tools

Code references

Best for: NLP Engineer, Computer Vision Engineer, AI Engineer, Machine Learning Engineer, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.