The Real Cost of Running AI: From FLOPs to GPUs to the KV Cache

· Source: LLM on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Emerging Technologies & Innovation · Depth: Expert, long

Summary

The article analyzes the real cost of running AI, tracing it from mathematical operations (FLOPs) to GPU hardware and memory bottlenecks. It details self-attention mechanics, including QKV projections (6nd² FLOPs), score matrix computation (2n²d FLOPs), and the MLP (16nd² FLOPs), culminating in a total cost of 24nd² + 4n²d FLOPs per transformer layer. It explains how prefill (prompt processing) is compute-bound, leveraging GPU cores efficiently, while decode (token generation) is memory-bound, running an H100 at roughly 1% of its theoretical peak. The KV cache is identified as a critical constraint on GPU concurrency, growing linearly with sequence length (e.g., 16 KB per token for a 1B model at INT8), directly impacting cost per user. The analysis also compares cloud inference (e.g., Gemini Flash-Lite at \$0.30/M) with on-device AI, highlighting the economic advantage of edge deployment for high-usage features due to zero marginal token cost and eliminated concurrency issues.

Key takeaway

For AI Architects evaluating deployment strategies, understand that long-context features are expensive due to lost GPU concurrency, not just slower computation. Your hardware purchasing decisions must differentiate between compute-bound prefill and memory-bound decode. Prioritize memory bandwidth for inference-heavy workloads. For high-usage, always-on features, on-device AI offers superior long-term economics, eliminating per-token costs and concurrency limits, despite initial engineering overhead.

Key insights

AI inference costs are driven by FLOPs, memory bandwidth, and KV cache size, dictating hardware utilization and concurrency.

Principles

Method

Self-attention is built from QKV projections, dot products for relevance (QK^T), softmax normalization, and value aggregation, followed by multi-head attention and MLPs.

In practice

Topics

Best for: AI Engineer, MLOps Engineer, NLP Engineer, Machine Learning Engineer, AI Architect, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by LLM on Medium.