DeepSeek-V4: a million-token context that agents can actually use

2026-04-24 · Source: Hugging Face - Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, medium

Summary

DeepSeek has released V4, featuring two Mixture-of-Experts (MoE) checkpoints: DeepSeek-V4-Pro (1.6T total parameters, 49B active) and DeepSeek-V4-Flash (284B total, 13B active), both offering a 1M-token context window. The models prioritize efficient long-context support for agentic tasks, addressing common failures like context budget overruns and KV cache exhaustion. DeepSeek-V4 achieves significant efficiency gains, requiring 27% of single-token inference FLOPs and 10% of KV cache memory compared to DeepSeek-V3.2 for V4-Pro, and even lower for V4-Flash. This efficiency stems from a hybrid attention architecture combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA), interleaved across layers, and utilizing FP8/FP4 storage. Additionally, V4 preserves reasoning across user message boundaries for tool-use conversations, introduces an XML-based tool-call schema with a "|DSML|" token, and was trained using DSec, a Rust-based sandbox for RL rollouts.

Key takeaway

For AI Architects and MLOps Engineers deploying agentic systems, DeepSeek V4 offers a compelling solution to long-context challenges. Its optimized architecture and agent-specific training, including persistent reasoning and a robust tool-call schema, directly address common failure modes in multi-step agent workflows. Consider evaluating DeepSeek V4-Pro or V4-Flash for your next agent-powered application, especially where context length and inference efficiency are critical.

Key insights

DeepSeek V4 optimizes large language models for agentic workflows through architectural and training innovations.

Principles

Efficiency enables utility for long contexts.
Hybrid attention reduces KV cache and FLOPs.
Preserving reasoning improves agent coherence.

Method

DeepSeek V4 employs a hybrid attention mechanism (CSA and HCA) with interleaved layers and FP8/FP4 storage for efficiency. It preserves reasoning across tool-call turns and uses an XML-based tool-call schema.

In practice

Use DeepSeek V4 for long-running agentic tasks.
Implement "|DSML|" XML schema for robust tool calls.
Leverage Think Max mode for complex reasoning.

Topics

DeepSeek-V4
Agentic Workloads
Long-Context Inference
Hybrid Attention
KV Cache Optimization

Best for: AI Architect, MLOps Engineer, CTO, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Hugging Face - Blog.