Nautilus: An Auto-Scheduling Tensor Compiler for Efficient Tiled GPU Kernels

2026-04-16 · Source: Machine Learning · Field: Technology & Digital — Software Development & Engineering, Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Nautilus is a new tensor compiler designed for fully automated math-to-kernel optimization, translating high-level algebraic tensor operator specifications into efficient tiled GPU kernels. Its successive lowering architecture enables joint application of high-level optimizations, expression rewrites, and tile optimizations within a single system. Nautilus features a novel auto-scheduler that identifies sequences of high-level optimizations, including aggressive global transformations like advanced reduction fusion, while maintaining the program structure required by tile optimizers. This compiler is notable as the first end-to-end system capable of generating FlashAttention-3-like kernels from a mathematical description of attention, thereby automating the entire optimization process. Benchmarking on NVIDIA GH200 and RTX 5090 GPUs across five transformer models and 150 configurations shows Nautilus achieving up to 23% higher throughput on GH200 and up to 42% on RTX 5090 compared to other compilers, often matching or surpassing manually written cuDNN kernels for long-sequence tasks.

Key takeaway

For AI Engineers and Research Scientists optimizing transformer-based models, Nautilus offers a significant advancement by automating the generation of highly efficient GPU kernels. You should consider evaluating Nautilus to reduce manual optimization effort and potentially achieve substantial throughput improvements, especially for long-sequence configurations on NVIDIA GH200 and RTX 5090 GPUs, where it demonstrated up to 42% higher throughput.

Key insights

Nautilus automates tensor kernel optimization from math specifications, outperforming existing compilers.

Principles

Jointly apply high-level and tile optimizations.
Auto-schedule high-level transformations.
Preserve program structure for tile optimizers.

Method

Nautilus uses successive lowering and an auto-scheduler to discover optimization sequences, translating algebraic tensor specifications into tiled GPU kernels while handling complex interactions like reduction fusion.

In practice

Generate FlashAttention-3-like kernels automatically.
Achieve higher throughput on NVIDIA GH200 and RTX 5090.
Match or exceed cuDNN for long sequences.

Topics

Nautilus Compiler
Auto-Scheduling
Tensor Compilers
Tiled GPU Kernels
High-Level Optimizations

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.