[AINews] DeepSeek V4 Pro (1.6T-A49B) and Flash (284B-A13B), Base and Instruct — runnable on Huawei Ascend chips

· Source: Latent.Space - Www.latent.space · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Emerging Technologies & Innovation · Depth: Expert, extended

Summary

DeepSeek has released DSV4, a new family of large language models including DeepSeek-V4 Pro and DeepSeek-V4 Flash, marking their first major architecture refresh since December 2024. DSV4 Pro features 1.6 trillion total parameters (49 billion active) and DSV4 Flash has 284 billion total parameters (13 billion active). Both models support an impressive 1 million token context window, achieved through novel Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) techniques, which reduce FLOPs by 73% and KV cache memory by 90% compared to DeepSeek-V3.2. The models were trained on 32-33 trillion tokens and utilize FP4/FP8 mixed precision. Independent benchmarks place V4 Pro as the #2 open-weight reasoning model, behind Kimi K2.6, with strong performance in long-context and agentic coding tasks. DeepSeek also released DeepEP V2 and TileKernels for optimization and parallelization, and the models are MIT-licensed with competitive API pricing.

Key takeaway

For AI Architects evaluating open-weight models for long-context or agentic applications, DeepSeek V4 Pro and Flash offer compelling performance and efficiency. Your teams should investigate V4's novel attention mechanisms and FP4/FP8 quantization for potential integration, especially given its 1M token context and competitive MIT license. Be mindful of the high token usage in some evaluations, which could impact overall task cost despite low per-token pricing.

Key insights

DeepSeek V4 advances open-weight long-context and agentic coding through novel attention mechanisms and efficient architecture.

Principles

Method

DeepSeek V4 employs Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) with shared KV vectors, compressed KV streams, and top-k sparse attention to achieve 1M token context with reduced memory footprint.

In practice

Topics

Code references

Best for: AI Architect, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Latent.Space - Www.latent.space.