As agentic AI pushes rivals to raise prices and cap usage, Deepseek ships a good-enough model for almost nothing

2026-04-24 · Source: The Decoder · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Emerging Technologies & Innovation · Depth: Advanced, short

Summary

Chinese AI lab Deepseek has released V4-Pro and V4-Flash, two new open-weight models featuring up to 1.6 trillion parameters and a one-million-token context window. These models introduce a novel hybrid attention architecture that significantly reduces compute requirements for long contexts, needing only 27% of FLOPs and 10% of KV cache compared to V3.2 for V4-Pro. This efficiency allows Deepseek to price V4-Flash at $0.14 per million input tokens and V4-Pro at $1.74 per million input tokens, substantially undercutting competitors like OpenAI, Google, and Anthropic. Trained on up to 33 trillion tokens and refined through on-policy distillation from in-house specialist models, V4 models are optimized for agentic tasks and support both Nvidia GPUs and Huawei Ascend chips.

Key takeaway

For AI Engineers evaluating large language models for agentic applications, Deepseek's V4-Pro and V4-Flash offer a compelling balance of performance, context length, and cost. Your teams can achieve significant operational savings, especially with V4-Flash's aggressive pricing, while leveraging a one-million-token context window for complex tasks. Consider integrating these open-weight models to reduce API expenditures and enhance agentic capabilities.

Key insights

Deepseek's V4 models offer massive context windows and competitive performance at significantly reduced costs via architectural innovation.

Principles

Hybrid attention reduces long-context compute
Distillation refines specialist model knowledge
Open-weight models drive cost competition

Method

Deepseek employs a hybrid attention architecture combining token compression with sparse attention, followed by on-policy distillation from multiple in-house specialist models for post-training.

In practice

Integrate V4 models for agentic workflows
Utilize V4-Flash for cost-sensitive applications
Deploy on Nvidia GPUs or Huawei Ascend NPUs

Topics

Deepseek V4-Pro
Deepseek V4-Flash
Hybrid Attention Architecture
Mixture-of-Experts
Agentic Workflows

Best for: CTO, VP of Engineering/Data, AI Engineer, AI Scientist, Machine Learning Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Decoder.