DeepSeek v4
Summary
DeepSeek has released DeepSeek-V4 Pro and DeepSeek-V4 Flash, marking its first major architecture refresh since V3. These models feature a 1M-token context window, hybrid reasoning/non-reasoning modes, and an MIT license. DeepSeek-V4 Pro has 1.6T total parameters (49B active), while V4 Flash has 284B total (13B active), both trained on 32T-33T tokens. A new hybrid attention system, including shared KV vectors, compressed KV streams, and sparse attention, dramatically reduces KV cache size by 8.7x compared to V3.2. Independent benchmarks place V4 Pro near the top of open-weight models, comparable to Kimi K2.6 and GLM-5.1, with strong long-context and agentic coding performance. The models utilize mixed FP4 + FP8 quantization, allowing the full Pro model to fit on an 8x B200 node. API pricing for V4 Pro is $1.74/$3.48 per 1M input/output tokens, and V4 Flash is $0.14/$0.28.
Key takeaway
For CTOs and VPs of Engineering evaluating open-weight large language models, DeepSeek V4 offers a compelling option, particularly for long-context and agentic coding applications. Its 1M-token context and efficient KV-cache architecture, coupled with an MIT license and competitive API pricing, make it a strong contender against proprietary models. You should consider integrating V4 Flash for cost-sensitive projects requiring extensive context or V4 Pro for higher-performance agentic workflows, especially as its pricing is projected to decrease with Huawei Ascend 950 supernode deployment.
Key insights
DeepSeek V4 advances open-weight long-context and agentic coding with a novel architecture and competitive performance.
Principles
- Long-context models require significant KV-cache optimization.
- Hybrid attention systems can dramatically reduce memory footprint.
- Open-weight models can achieve near-frontier performance.
Method
DeepSeek V4 employs a hybrid attention system with shared KV vectors, compressed KV streams, sparse attention on compressed tokens, and a 128-token sliding window to achieve 1M context with reduced KV cache.
In practice
- Utilize DeepSeek V4 Flash for cost-effective long-context tasks.
- Consider mixed FP4/FP8 quantization for efficient model deployment.
- Explore agentic workflows with V4 Pro for enhanced coding performance.
Topics
- DeepSeek V4
- Long-Context AI
- Mixture-of-Experts
- Model Quantization
- Hardware-Model Co-design
Code references
Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AINews.