FlyDSL: Expert GPU Kernel Development with the Ease of MLIR Python Native DSL on AMD GPUs

2026-02-20 · Source: AMD ROCm Blogs · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, medium

Summary

AMD has introduced FlyDSL (Flexible Layout Python DSL), a new Python-first, MLIR-native DSL designed to accelerate expert-level GPU kernel development on AMD architectures. Released on February 20, 2026, FlyDSL aims to provide a modern, flexible, and open framework for authoring high-performance GPU kernels with explicit layouts and tiling. It is powered by FLIR (Flexible Layout Intermediate Representation), an MLIR-native compiler stack featuring a first-class layout IR and a composable lowering pipeline to GPU/ROCDL. FlyDSL offers a familiar pathway for developers accustomed to Cutlass and CuTe DSLs, a Python-based alternative to template-heavy HIP C++, and complements Triton by targeting thread-level and IR-level control for roofline performance. It supports essential AI operators like Softmax, LayerNorm, Quantization, GEMM, and Mixture of Experts (MOE) kernels, with early production adoption for large-scale inference workloads.

Key takeaway

For NLP Engineers optimizing large-scale LLM workloads on AMD GPUs, FlyDSL offers a direct path to achieve roofline performance. You can leverage its Python-first approach and explicit thread-level control to fine-tune kernels beyond Triton's abstraction, reducing iteration times and improving predictability. Consider integrating FlyDSL for developing or porting high-performance operators like FlashAttention or custom GEMM kernels to the ROCm ecosystem.

Key insights

FlyDSL is a Python-first, MLIR-native DSL for high-performance GPU kernel development on AMD GPUs.

Principles

Explicit layouts and tiling are crucial for performance.
Python DSLs simplify kernel authoring and iteration.
MLIR-native compilation ensures predictable lowering.

Method

FlyDSL uses a Python DSL with AST transforms to convert control flow into MLIR, followed by a JIT-friendly compilation and a clear MLIR → ROCDL → HSACO lowering pipeline.

In practice

Migrate Cutlass/CuTe DSL kernels to AMD with FlyDSL.
Develop custom GEMM/attention kernels for ROCm.
Optimize AI operators like Softmax and LayerNorm.

Topics

FlyDSL
GPU Kernel Development
MLIR
AMD ROCm
CuTe Layout Algebra

Code references

ROCm/FlyDSL

Best for: NLP Engineer, AI Engineer, Machine Learning Engineer, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.