FlyDSL: Expert GPU Kernel Development with the Ease of MLIR Python Native DSL on AMD GPUs
Summary
AMD has introduced FlyDSL (Flexible Layout Python DSL), a new Python-first, MLIR-native DSL designed to accelerate expert-level GPU kernel development on AMD architectures. Released on February 20, 2026, FlyDSL aims to provide a modern, flexible, and open framework for authoring high-performance GPU kernels with explicit layouts and tiling. It is powered by FLIR (Flexible Layout Intermediate Representation), an MLIR-native compiler stack featuring a first-class layout IR and a composable lowering pipeline to GPU/ROCDL. FlyDSL offers a familiar pathway for developers accustomed to Cutlass and CuTe DSLs, a Python-based alternative to template-heavy HIP C++, and complements Triton by targeting thread-level and IR-level control for roofline performance. It supports essential AI operators like Softmax, LayerNorm, Quantization, GEMM, and Mixture of Experts (MOE) kernels, with early production adoption for large-scale inference workloads.
Key takeaway
For NLP Engineers optimizing large-scale LLM workloads on AMD GPUs, FlyDSL offers a direct path to achieve roofline performance. You can leverage its Python-first approach and explicit thread-level control to fine-tune kernels beyond Triton's abstraction, reducing iteration times and improving predictability. Consider integrating FlyDSL for developing or porting high-performance operators like FlashAttention or custom GEMM kernels to the ROCm ecosystem.
Key insights
FlyDSL is a Python-first, MLIR-native DSL for high-performance GPU kernel development on AMD GPUs.
Principles
- Explicit layouts and tiling are crucial for performance.
- Python DSLs simplify kernel authoring and iteration.
- MLIR-native compilation ensures predictable lowering.
Method
FlyDSL uses a Python DSL with AST transforms to convert control flow into MLIR, followed by a JIT-friendly compilation and a clear MLIR → ROCDL → HSACO lowering pipeline.
In practice
- Migrate Cutlass/CuTe DSL kernels to AMD with FlyDSL.
- Develop custom GEMM/attention kernels for ROCm.
- Optimize AI operators like Softmax and LayerNorm.
Topics
- FlyDSL
- GPU Kernel Development
- MLIR
- AMD ROCm
- CuTe Layout Algebra
Code references
Best for: NLP Engineer, AI Engineer, Machine Learning Engineer, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.