How Attention Got So Efficient [GQA/MLA/DSA]

· Source: Jia-Bin Huang · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Advanced, long

Summary

Deepseek's experimental model, released in late September 2025, features Deepseek Sparse Attention (DSA), which reduces compute costs and API pricing by 50% while maintaining performance. This efficiency stems from advancements in attention mechanisms, building upon foundational concepts like tokenization, embeddings, and the query-key-value model. Multi-head attention (MHA) enhances the model's ability to capture complex relationships, with techniques like KV caching optimizing inference by storing previous key and value vectors. To mitigate KV cache memory intensity, Multi-Query Attention (MQA) and Group Query Attention (GQA) reduce the number of key/value heads, with GQA balancing memory efficiency and expressive power. Multi-Head Latent Attention (MLA) further optimizes by compressing token embeddings into low-dimensional feature vectors, achieving a 57x memory reduction and performance improvement. DSA introduces a "lightning indexer" that quantizes query and key vectors to 8-bit representations, using Hadamard transforms for accuracy, enabling 2-3x faster processing of long sequences and 30-40% memory reduction.

Key takeaway

For AI Architects and MLOps Engineers optimizing large language model deployment, Deepseek Sparse Attention (DSA) and Multi-Head Latent Attention (MLA) offer critical advancements. Your teams should investigate integrating these techniques, particularly for long sequence processing, to achieve substantial reductions in memory footprint and inference costs without sacrificing model performance. Prioritize evaluating GQA or MLA for KV cache optimization and DSA for overall throughput improvements.

Key insights

Efficient attention mechanisms like DSA significantly reduce LLM compute costs and memory while preserving performance.

Principles

Method

Deepseek Sparse Attention (DSA) uses a lightning indexer with 8-bit quantized query/key vectors and Hadamard transforms to select relevant tokens, achieving sparse attention patterns for efficiency.

In practice

Topics

Best for: AI Architect, MLOps Engineer, NLP Engineer, AI Engineer, Machine Learning Engineer, Deep Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Jia-Bin Huang.