Inside the Together AI kernels team

2026-06-09 · Source: Together AI | The AI Native Cloud - Together.ai · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Software Development & Engineering · Depth: Expert, medium

Summary

The Together AI kernels team, founded by Dan Fu and Tri Dao, originated from their 2022 FlashAttention breakthrough, which achieved 2-3x speedups in transformer attention by optimizing GPU memory movement. This challenged the conventional wisdom that GPU performance was fully optimized. The team focuses on closing the "software layer" gap between AI models and hardware, which is critical for AI-native applications. In March 2025, the 15-person team used their ThunderKittens library to rapidly develop optimized FP4 and FP8 GEMM kernels for NVIDIA's new Blackwell GPUs within a week, achieving up to 2x speedups over cuBLAS on H100s. This academic-industry collaboration model, involving researchers from UCSD, Princeton, and Caltech, also produced Together Megakernel, which reduced time-to-first-64-tokens for a real-time voice agent from 281ms to 77ms on Llama-3.2-1B, demonstrating 3.6x performance improvement. The team offers custom kernel optimization for strategic partners with tight SLAs.

Key takeaway

For MLOps engineers or AI infrastructure architects optimizing real-time AI applications, recognize that generic infrastructure often falls short. Your team should consider engaging specialized kernel optimization teams like Together AI to achieve critical latency targets and improve unit economics. Custom kernels, like Together Megakernel, can deliver significant performance improvements, such as reducing time-to-first-64-tokens from 281ms to 77ms, directly impacting user experience and operational costs.

Key insights

GPU kernel optimization, exemplified by FlashAttention, offers substantial performance gains for AI workloads.

Principles

Data locality and memory hierarchies are key to GPU optimization.
Academic-industry collaboration accelerates kernel development.
Custom kernels are essential for AI-native application performance.

Method

ThunderKittens abstracts NVIDIA's tensor cores, reducing CUDA code from 1,000+ to 100-200 lines, enabling rapid kernel adaptation for new hardware generations like Blackwell GPUs.

In practice

Apply database principles to optimize GPU memory access.
Explore custom kernel solutions for latency-sensitive AI apps.
Investigate ThunderKittens for faster kernel development.

Topics

GPU Kernel Optimization
FlashAttention
ThunderKittens
NVIDIA Blackwell GPUs
AI Native Cloud
Real-time AI

Best for: AI Engineer, Machine Learning Engineer, NLP Engineer, AI Scientist, AI Hardware Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Together AI | The AI Native Cloud - Together.ai.