DeepSeek v4

2026-04-24 · Source: AINews · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Advanced, extended

Summary

DeepSeek has released DeepSeek-V4 Pro and DeepSeek-V4 Flash, marking its first major architecture refresh since V3. These models feature a 1M-token context window, hybrid reasoning/non-reasoning modes, and an MIT license. DeepSeek-V4 Pro has 1.6T total parameters (49B active), while V4 Flash has 284B total (13B active), both trained on 32T-33T tokens. A new hybrid attention system, including shared KV vectors, compressed KV streams, and sparse attention, dramatically reduces KV cache size by 8.7x compared to V3.2. Independent benchmarks place V4 Pro near the top of open-weight models, comparable to Kimi K2.6 and GLM-5.1, with strong long-context and agentic coding performance. The models utilize mixed FP4 + FP8 quantization, allowing the full Pro model to fit on an 8x B200 node. API pricing for V4 Pro is $1.74/$3.48 per 1M input/output tokens, and V4 Flash is $0.14/$0.28.

Key takeaway

For CTOs and VPs of Engineering evaluating open-weight large language models, DeepSeek V4 offers a compelling option, particularly for long-context and agentic coding applications. Its 1M-token context and efficient KV-cache architecture, coupled with an MIT license and competitive API pricing, make it a strong contender against proprietary models. You should consider integrating V4 Flash for cost-sensitive projects requiring extensive context or V4 Pro for higher-performance agentic workflows, especially as its pricing is projected to decrease with Huawei Ascend 950 supernode deployment.

Key insights

DeepSeek V4 advances open-weight long-context and agentic coding with a novel architecture and competitive performance.

Principles

Long-context models require significant KV-cache optimization.
Hybrid attention systems can dramatically reduce memory footprint.
Open-weight models can achieve near-frontier performance.

Method

DeepSeek V4 employs a hybrid attention system with shared KV vectors, compressed KV streams, sparse attention on compressed tokens, and a 128-token sliding window to achieve 1M context with reduced KV cache.

In practice

Utilize DeepSeek V4 Flash for cost-effective long-context tasks.
Consider mixed FP4/FP8 quantization for efficient model deployment.
Explore agentic workflows with V4 Pro for enhanced coding performance.

Topics

DeepSeek V4
Long-Context AI
Mixture-of-Experts
Model Quantization
Hardware-Model Co-design

Code references

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AINews.