AMD GPU Programming From Beginner to Expert (Part 1) - TensorDescriptor in Composable Kernel (CK)

2026-03-25 · Source: AMD ROCm Blogs · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, long

Summary

This article, part of a series on AMD GPU programming, introduces the Composable Kernel (CK) framework's TensorDescriptor, a fundamental abstraction for managing multi-dimensional data layouts and transformations. It explains how TensorDescriptor uses a tree structure of "Transforms" (like Embed, Unmerge, Merge, and PassThrough) to map logical coordinates to physical memory addresses. The content provides a detailed example of building a 3D tensor from a 2D base using these transforms and includes a C++ code example demonstrating their instantiation and chaining. Furthermore, it presents a complete, optimized GPU kernel implementation for matrix transpose on AMD GPUs, detailing host code, kernel logic, and performance, showing a 44.3% throughput improvement over PyTorch, achieving 5.820 μs compared to 8.4 μs.

Key takeaway

For AI Engineers and Machine Learning Engineers optimizing GPU kernel performance on AMD hardware, understanding CK's TensorDescriptor and its composable Transforms is crucial. You should explore implementing custom data layouts using chained `Unmerge`, `Merge`, and `PassThrough` transforms, and adopt the demonstrated 4x4 per-thread, register-level computation pattern for operations like matrix transpose to achieve significant throughput improvements, as shown by the 44.3% gain over PyTorch.

Key insights

Composable Kernel's TensorDescriptor uses hierarchical transforms to efficiently manage complex multi-dimensional data layouts on AMD GPUs.

Principles

Logical coordinates map to physical memory via stride vectors.
Transforms are composable operations for coordinate space mapping.
Efficient GPU kernels use vectorized access and register-level computation.

Method

TensorDescriptor defines tensors using a tree of Transforms, each with a `CalculateLowerIndex` method, to map upper-level coordinates to lower-level ones, ultimately resolving to a linear memory offset.

In practice

Use `UnMerge` to split dimensions (e.g., M into M1, M2).
Chain `Transforms` to create complex tensor layouts.
Implement 4x4 sub-matrix processing per thread for transpose.

Topics

AMD GPU Programming
Composable Kernel
TensorDescriptor
GPU Kernel Optimization
Matrix Transpose

Best for: AI Engineer, Machine Learning Engineer, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.