Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

2026-06-06 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Emerging Technologies & Innovation · Depth: Expert, extended

Summary

NVIDIA's CUDA Tile (CuTile) is a Python-based, tile-centric abstraction for GPU kernel development, aiming to simplify programming while maintaining Tensor Core and Tensor Memory Accelerator (TMA) efficiency. An independent, cross-architecture evaluation compared CuTile against cuBLAS, Triton, WMMA, and raw SIMT on NVIDIA H100 NVL (Hopper), B200, and RTX PRO 6000 Blackwell Server Edition GPUs. Results show CuTile's effectiveness varies significantly by workload and architecture. On datacenter Blackwell (B200), CuTile achieved up to 1,007 TFLOP/s for fused attention, outperforming FlashAttention-2 by 2.5x with only 60 lines of Python code. For GEMM, CuTile reached 52–79% of cuBLAS performance in 22 lines, making it a practical replacement for hand-written WMMA kernels. However, on RTX PRO 6000 (sm_120), the same CuTile attention kernel achieved only 53% of FlashAttention-2 throughput, exposing significant cross-architecture optimization gaps. Triton demonstrated stronger portability, sustaining 62–101% of cuBLAS performance across all platforms without architecture-specific tuning.

Key takeaway

For AI Engineers or ML Architects developing GPU kernels, if you are targeting NVIDIA B200/B100 datacenter GPUs for fused attention, you should switch to CuTile to achieve 2.5x faster performance than FlashAttention-2 with significantly less code. However, if cross-architecture portability across Hopper and Blackwell, or consistent performance on workstation Blackwell GPUs (sm_120) is critical, prioritize Triton or established libraries like cuBLAS, as CuTile's performance is highly architecture-dependent and currently immature on sm_120.

Key insights

CuTile offers high productivity and peak performance on datacenter Blackwell, but lacks portability and consistent performance across other architectures.

Principles

Compiler maturity impacts performance.
Portability often trades off with peak performance.
Code brevity can replace complex CUDA.

Method

CuTile abstracts GPU kernel development using ct.load, ct.mma, and ct.store primitives, enabling Tensor Core and TMA utilization through Python, with architecture-specific tuning via @ct.kernel and ByTarget.

In practice

Use CuTile for fused attention on B200.
Replace WMMA kernels with CuTile on Blackwell.
Consider Triton for cross-architecture needs.

Topics

CUDA Tile
GPU Kernel Development
Blackwell Architecture
Hopper Architecture
LLM Inference Optimization
Tensor Cores
Triton

Code references

Best for: NLP Engineer, AI Scientist, Research Scientist, AI Engineer, Machine Learning Engineer, AI Architect

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.