Flash Attention Mechanics: How Tiled Attention Fits in SRAM

· Source: Towards AI - Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

FlashAttention, a kernel-level rewrite introduced by Dao et al. (2022), significantly optimizes the self-attention operation by avoiding the materialization of the full N×N attention score matrix. Standard attention for a 4096-token sequence requires storing a 1.0 GB FP16 matrix, leading to over 4 GB of HBM I/O. FlashAttention eliminates this matrix, fitting computation tiles within approximately 129 KB of per-SM SRAM on an A100 GPU. This optimization is I/O-bound, maintaining identical FLOPs while drastically reducing HBM traffic by about 33× at 4K tokens and 129× at 16K tokens. The total attention memory footprint also sees a substantial reduction, dropping approximately 9× at 4K tokens.

Key takeaway

For Machine Learning Engineers optimizing large language models with long sequences, you should consider integrating FlashAttention. This kernel-level rewrite drastically reduces HBM traffic by 33× at 4K tokens and lowers total attention memory by 9×. Implementing FlashAttention can significantly improve model training and inference efficiency, especially on memory-constrained hardware like A100 GPUs, without increasing FLOPs.

Key insights

FlashAttention optimizes self-attention by avoiding full matrix materialization, fitting computations in SRAM to reduce HBM traffic.

Principles

Method

FlashAttention employs a kernel-level rewrite to process attention in tiles that fit within SRAM, eliminating the need to write the full attention score matrix to HBM.

In practice

Topics

Best for: AI Engineer, NLP Engineer, Research Scientist, Machine Learning Engineer, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards AI - Medium.