Inside the Together AI kernels team
Summary
The Together AI kernels team, founded by Dan Fu and Tri Dao, originated from their 2022 FlashAttention breakthrough, which achieved 2-3x speedups in transformer attention by optimizing GPU memory movement. This challenged the conventional wisdom that GPU performance was fully optimized. The team focuses on closing the "software layer" gap between AI models and hardware, which is critical for AI-native applications. In March 2025, the 15-person team used their ThunderKittens library to rapidly develop optimized FP4 and FP8 GEMM kernels for NVIDIA's new Blackwell GPUs within a week, achieving up to 2x speedups over cuBLAS on H100s. This academic-industry collaboration model, involving researchers from UCSD, Princeton, and Caltech, also produced Together Megakernel, which reduced time-to-first-64-tokens for a real-time voice agent from 281ms to 77ms on Llama-3.2-1B, demonstrating 3.6x performance improvement. The team offers custom kernel optimization for strategic partners with tight SLAs.
Key takeaway
For MLOps engineers or AI infrastructure architects optimizing real-time AI applications, recognize that generic infrastructure often falls short. Your team should consider engaging specialized kernel optimization teams like Together AI to achieve critical latency targets and improve unit economics. Custom kernels, like Together Megakernel, can deliver significant performance improvements, such as reducing time-to-first-64-tokens from 281ms to 77ms, directly impacting user experience and operational costs.
Key insights
GPU kernel optimization, exemplified by FlashAttention, offers substantial performance gains for AI workloads.
Principles
- Data locality and memory hierarchies are key to GPU optimization.
- Academic-industry collaboration accelerates kernel development.
- Custom kernels are essential for AI-native application performance.
Method
ThunderKittens abstracts NVIDIA's tensor cores, reducing CUDA code from 1,000+ to 100-200 lines, enabling rapid kernel adaptation for new hardware generations like Blackwell GPUs.
In practice
- Apply database principles to optimize GPU memory access.
- Explore custom kernel solutions for latency-sensitive AI apps.
- Investigate ThunderKittens for faster kernel development.
Topics
- GPU Kernel Optimization
- FlashAttention
- ThunderKittens
- NVIDIA Blackwell GPUs
- AI Native Cloud
- Real-time AI
Best for: AI Engineer, Machine Learning Engineer, NLP Engineer, AI Scientist, AI Hardware Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Together AI | The AI Native Cloud - Together.ai.