The Real Cost of Running AI: From FLOPs to GPUs to the KV Cache
Summary
The article analyzes the real cost of running AI, tracing it from mathematical operations (FLOPs) to GPU hardware and memory bottlenecks. It details self-attention mechanics, including QKV projections (6nd² FLOPs), score matrix computation (2n²d FLOPs), and the MLP (16nd² FLOPs), culminating in a total cost of 24nd² + 4n²d FLOPs per transformer layer. It explains how prefill (prompt processing) is compute-bound, leveraging GPU cores efficiently, while decode (token generation) is memory-bound, running an H100 at roughly 1% of its theoretical peak. The KV cache is identified as a critical constraint on GPU concurrency, growing linearly with sequence length (e.g., 16 KB per token for a 1B model at INT8), directly impacting cost per user. The analysis also compares cloud inference (e.g., Gemini Flash-Lite at \$0.30/M) with on-device AI, highlighting the economic advantage of edge deployment for high-usage features due to zero marginal token cost and eliminated concurrency issues.
Key takeaway
For AI Architects evaluating deployment strategies, understand that long-context features are expensive due to lost GPU concurrency, not just slower computation. Your hardware purchasing decisions must differentiate between compute-bound prefill and memory-bound decode. Prioritize memory bandwidth for inference-heavy workloads. For high-usage, always-on features, on-device AI offers superior long-term economics, eliminating per-token costs and concurrency limits, despite initial engineering overhead.
Key insights
AI inference costs are driven by FLOPs, memory bandwidth, and KV cache size, dictating hardware utilization and concurrency.
Principles
- Transformer cost: 24nd² + 4n²d per layer.
- Prefill is compute-bound; decode is memory-bound.
- KV cache size limits GPU concurrency.
Method
Self-attention is built from QKV projections, dot products for relevance (QK^T), softmax normalization, and value aggregation, followed by multi-head attention and MLPs.
In practice
- Use GQA to cut KV cache by 4x.
- Optimize for memory bandwidth in decode.
- Consider on-device for high-usage features.
Topics
- AI Inference Economics
- Transformer Architecture
- KV Cache Optimization
- GPU Performance Bottlenecks
- On-device AI Deployment
- Memory Bandwidth
Best for: AI Engineer, MLOps Engineer, NLP Engineer, Machine Learning Engineer, AI Architect, AI Hardware Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by LLM on Medium.