Serving DeepSeek-V4: why million-token context is an inference systems problem

· Source: Together AI | The AI Native Cloud - Together.ai · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Advanced, long

Summary

DeepSeek-V4 introduces a 1M-token context window through a novel hybrid attention architecture, transforming long-context capabilities into an inference systems challenge. Its design compresses context before Key-Value (KV) storage, mixing Compressed Sparse Attention (CSA) (stride 4, 8-token neighborhoods) and Heavily Compressed Attention (HCA) (stride 128, 8K entries for 1M tokens) with Sliding Window Attention (SWA) (128 tokens). This approach significantly reduces KV cache pressure, improving concurrency and throughput. However, realizing these gains requires sophisticated inference engine management of diverse cache layouts, complex prefix reuse policies, and workload-specific endpoint profiles. Together's early work on NVIDIA HGX B200 demonstrated increasing KV-cache capacity from 1.2M to 3.7M tokens by optimizing SWA state, highlighting the critical role of serving policy.

Key takeaway

For MLOps Engineers deploying DeepSeek-V4, recognize that its 1M-token context efficiency is a system property, not an out-of-the-box feature. You must actively manage diverse KV cache layouts, optimize prefix reuse policies (e.g., SWA recompute), and configure workload-specific endpoint profiles. Benchmark your specific context-length regime and traffic shapes to unlock V4's architectural savings, ensuring lower latency and higher concurrency for long-context applications like coding agents.

Key insights

DeepSeek-V4's 1M-token context is an inference systems problem, demanding specialized cache management for efficiency.

Principles

Method

DeepSeek-V4 employs a hybrid attention design (CSA, HCA, SWA) to compress context before KV storage, mixing compressed and local attention paths, and adapting prefix reuse strategies.

In practice

Topics

Best for: Machine Learning Engineer, AI Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Together AI | The AI Native Cloud - Together.ai.