AMD GPU Programming From Beginner to Expert (Part 1) - TensorDescriptor in Composable Kernel (CK)
Summary
This article, part of a series on AMD GPU programming, introduces the Composable Kernel (CK) framework's TensorDescriptor, a fundamental abstraction for managing multi-dimensional data layouts and transformations. It explains how TensorDescriptor uses a tree structure of "Transforms" (like Embed, Unmerge, Merge, and PassThrough) to map logical coordinates to physical memory addresses. The content provides a detailed example of building a 3D tensor from a 2D base using these transforms and includes a C++ code example demonstrating their instantiation and chaining. Furthermore, it presents a complete, optimized GPU kernel implementation for matrix transpose on AMD GPUs, detailing host code, kernel logic, and performance, showing a 44.3% throughput improvement over PyTorch, achieving 5.820 μs compared to 8.4 μs.
Key takeaway
For AI Engineers and Machine Learning Engineers optimizing GPU kernel performance on AMD hardware, understanding CK's TensorDescriptor and its composable Transforms is crucial. You should explore implementing custom data layouts using chained `Unmerge`, `Merge`, and `PassThrough` transforms, and adopt the demonstrated 4x4 per-thread, register-level computation pattern for operations like matrix transpose to achieve significant throughput improvements, as shown by the 44.3% gain over PyTorch.
Key insights
Composable Kernel's TensorDescriptor uses hierarchical transforms to efficiently manage complex multi-dimensional data layouts on AMD GPUs.
Principles
- Logical coordinates map to physical memory via stride vectors.
- Transforms are composable operations for coordinate space mapping.
- Efficient GPU kernels use vectorized access and register-level computation.
Method
TensorDescriptor defines tensors using a tree of Transforms, each with a `CalculateLowerIndex` method, to map upper-level coordinates to lower-level ones, ultimately resolving to a linear memory offset.
In practice
- Use `UnMerge` to split dimensions (e.g., M into M1, M2).
- Chain `Transforms` to create complex tensor layouts.
- Implement 4x4 sub-matrix processing per thread for transpose.
Topics
- AMD GPU Programming
- Composable Kernel
- TensorDescriptor
- GPU Kernel Optimization
- Matrix Transpose
Best for: AI Engineer, Machine Learning Engineer, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.