Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs
Summary
NVIDIA's CUDA Tile (CuTile) is a Python-based, tile-centric abstraction for GPU kernel development, aiming to simplify programming while maintaining Tensor Core and Tensor Memory Accelerator (TMA) efficiency. An independent, cross-architecture evaluation compared CuTile against cuBLAS, Triton, WMMA, and raw SIMT on NVIDIA H100 NVL (Hopper), B200, and RTX PRO 6000 Blackwell Server Edition GPUs. Results show CuTile's effectiveness varies significantly by workload and architecture. On datacenter Blackwell (B200), CuTile achieved up to 1,007 TFLOP/s for fused attention, outperforming FlashAttention-2 by 2.5x with only 60 lines of Python code. For GEMM, CuTile reached 52–79% of cuBLAS performance in 22 lines, making it a practical replacement for hand-written WMMA kernels. However, on RTX PRO 6000 (sm_120), the same CuTile attention kernel achieved only 53% of FlashAttention-2 throughput, exposing significant cross-architecture optimization gaps. Triton demonstrated stronger portability, sustaining 62–101% of cuBLAS performance across all platforms without architecture-specific tuning.
Key takeaway
For AI Engineers or ML Architects developing GPU kernels, if you are targeting NVIDIA B200/B100 datacenter GPUs for fused attention, you should switch to CuTile to achieve 2.5x faster performance than FlashAttention-2 with significantly less code. However, if cross-architecture portability across Hopper and Blackwell, or consistent performance on workstation Blackwell GPUs (sm_120) is critical, prioritize Triton or established libraries like cuBLAS, as CuTile's performance is highly architecture-dependent and currently immature on sm_120.
Key insights
CuTile offers high productivity and peak performance on datacenter Blackwell, but lacks portability and consistent performance across other architectures.
Principles
- Compiler maturity impacts performance.
- Portability often trades off with peak performance.
- Code brevity can replace complex CUDA.
Method
CuTile abstracts GPU kernel development using ct.load, ct.mma, and ct.store primitives, enabling Tensor Core and TMA utilization through Python, with architecture-specific tuning via @ct.kernel and ByTarget.
In practice
- Use CuTile for fused attention on B200.
- Replace WMMA kernels with CuTile on Blackwell.
- Consider Triton for cross-architecture needs.
Topics
- CUDA Tile
- GPU Kernel Development
- Blackwell Architecture
- Hopper Architecture
- LLM Inference Optimization
- Tensor Cores
- Triton
Code references
Best for: NLP Engineer, AI Scientist, Research Scientist, AI Engineer, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.