Flash Attention Mechanics: How Tiled Attention Fits in SRAM
Summary
FlashAttention, a kernel-level rewrite introduced by Dao et al. (2022), significantly optimizes the self-attention operation by avoiding the materialization of the full N×N attention score matrix. Standard attention for a 4096-token sequence requires storing a 1.0 GB FP16 matrix, leading to over 4 GB of HBM I/O. FlashAttention eliminates this matrix, fitting computation tiles within approximately 129 KB of per-SM SRAM on an A100 GPU. This optimization is I/O-bound, maintaining identical FLOPs while drastically reducing HBM traffic by about 33× at 4K tokens and 129× at 16K tokens. The total attention memory footprint also sees a substantial reduction, dropping approximately 9× at 4K tokens.
Key takeaway
For Machine Learning Engineers optimizing large language models with long sequences, you should consider integrating FlashAttention. This kernel-level rewrite drastically reduces HBM traffic by 33× at 4K tokens and lowers total attention memory by 9×. Implementing FlashAttention can significantly improve model training and inference efficiency, especially on memory-constrained hardware like A100 GPUs, without increasing FLOPs.
Key insights
FlashAttention optimizes self-attention by avoiding full matrix materialization, fitting computations in SRAM to reduce HBM traffic.
Principles
- N×N attention matrices dominate memory and bandwidth for long sequences.
- I/O-bound operations benefit significantly from on-chip memory utilization.
Method
FlashAttention employs a kernel-level rewrite to process attention in tiles that fit within SRAM, eliminating the need to write the full attention score matrix to HBM.
In practice
- Reduce HBM traffic for attention by 33× at 4K tokens.
- Achieve ~9× total attention memory reduction at 4K tokens.
- Utilize ~129 KB per-SM SRAM for attention computations.
Topics
- FlashAttention
- Self-Attention
- GPU Optimization
- SRAM
- HBM Traffic Reduction
- Kernel Rewrites
Best for: AI Engineer, NLP Engineer, Research Scientist, Machine Learning Engineer, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Towards AI - Medium.