AutoMegaKernel: A Statically-Checked Agent Harness for Self-Retargeting Megakernel Synthesis

2026-06-08 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, quick

Summary

AutoMegaKernel (AMK) compiles HuggingFace Llama-family models into a single persistent cooperative CUDA kernel, executing the entire forward pass in one launch without per-model hand-written CUDA. A frozen schedule-IR validator statically certifies deadlock-freedom and race-freedom via graph checks, rejecting 6,091 unsafe adversarial schedules with zero false-accepts and accepting all 360 real lowerings. AMK retargets sm_80/sm_90/sm_120 from one codebase, auto-generating correct megakernels for 10 supported models. It reproduces HuggingFace greedy decode token-for-token on SmolLM2-135M (perplexity match 2.5e-7) and self-improves its megakernel by 1.25-1.72x through an agent-drivable autoresearch loop. An int8 (W8A16) megakernel beats CUDA-graphed cuBLAS bf16 at batch-1 decode on NVIDIA's datacenter inference fleet: up to 1.33x on L4, 1.25-1.27x on L40S, 1.08x on A10G, and 1.19-1.23x on RTX 5090, though it trails on A100/H100.

Key takeaway

For Machine Learning Engineers optimizing Llama-family model inference, AutoMegaKernel presents a compelling alternative to cuBLAS on NVIDIA's inference-class GPUs. If you are deploying on L4, L40S, A10G, or RTX 5090, consider integrating AMK to achieve up to 1.33x speedups for batch-1 W8A16 decode. This system offers statically-checked kernel safety and self-improving performance, streamlining deployment across diverse architectures. However, be aware that AMK currently trails cuBLAS on high-bandwidth training-class GPUs like A100/H100.

Key insights

AutoMegaKernel synthesizes safe, efficient, and retargetable megakernels for Llama models, outperforming cuBLAS on inference-class GPUs.

Principles

Statically certifying kernel schedules prevents runtime errors.
Agent-driven autoresearch can self-improve kernel performance.
Single codebase can retarget multiple GPU architectures.

Method

AMK compiles Llama models into a single CUDA megakernel, validates schedules statically for safety, and uses an agent-drivable loop for performance optimization and architecture retargeting.

In practice

Deploy Llama models on NVIDIA L4/L40S/RTX 5090 for faster inference.
Explore W8A16 quantization for batch-1 decode performance.
Use static schedule validation to ensure kernel safety.

Topics

AutoMegaKernel
Megakernel Synthesis
CUDA Kernels
Llama Models
LLM Inference
GPU Optimization

Code references

RightNow-AI/AutoMegaKernel

Best for: AI Engineer, NLP Engineer, Research Scientist, Machine Learning Engineer, AI Scientist, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.