Nautilus: An Auto-Scheduling Tensor Compiler for Efficient Tiled GPU Kernels
Summary
Nautilus is a new tensor compiler designed for fully automated math-to-kernel optimization, translating high-level algebraic tensor operator specifications into efficient tiled GPU kernels. Its successive lowering architecture enables joint application of high-level optimizations, expression rewrites, and tile optimizations within a single system. Nautilus features a novel auto-scheduler that identifies sequences of high-level optimizations, including aggressive global transformations like advanced reduction fusion, while maintaining the program structure required by tile optimizers. This compiler is notable as the first end-to-end system capable of generating FlashAttention-3-like kernels from a mathematical description of attention, thereby automating the entire optimization process. Benchmarking on NVIDIA GH200 and RTX 5090 GPUs across five transformer models and 150 configurations shows Nautilus achieving up to 23% higher throughput on GH200 and up to 42% on RTX 5090 compared to other compilers, often matching or surpassing manually written cuDNN kernels for long-sequence tasks.
Key takeaway
For AI Engineers and Research Scientists optimizing transformer-based models, Nautilus offers a significant advancement by automating the generation of highly efficient GPU kernels. You should consider evaluating Nautilus to reduce manual optimization effort and potentially achieve substantial throughput improvements, especially for long-sequence configurations on NVIDIA GH200 and RTX 5090 GPUs, where it demonstrated up to 42% higher throughput.
Key insights
Nautilus automates tensor kernel optimization from math specifications, outperforming existing compilers.
Principles
- Jointly apply high-level and tile optimizations.
- Auto-schedule high-level transformations.
- Preserve program structure for tile optimizers.
Method
Nautilus uses successive lowering and an auto-scheduler to discover optimization sequences, translating algebraic tensor specifications into tiled GPU kernels while handling complex interactions like reduction fusion.
In practice
- Generate FlashAttention-3-like kernels automatically.
- Achieve higher throughput on NVIDIA GH200 and RTX 5090.
- Match or exceed cuDNN for long sequences.
Topics
- Nautilus Compiler
- Auto-Scheduling
- Tensor Compilers
- Tiled GPU Kernels
- High-Level Optimizations
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.