DeepSeek-V4: The Interesting Part Is the Attention Architecture

· Source: The Salt - Curated AI · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

DeepSeek-V4 is a new family of Mixture-of-Experts (MoE) models designed for million-token contexts, aiming to make long context practical without incurring the full computational cost of standard attention mechanisms. The family includes DeepSeek-V4-Pro, featuring 1.6 trillion total parameters with 49 billion activated per token, and DeepSeek-V4-Flash, with 284 billion total parameters and 13 billion activated per token. Key architectural innovations over DeepSeek V3 include hybrid compressed attention and a novel residual-stream mechanism called Manifold-Constrained Hyper-Connections (mHC). This design allows the model's attention layers to process the past not as a flat list of all tokens, but by storing compressed summaries, selectively retrieving relevant information, and maintaining a small exact local window for recent tokens.

Key takeaway

For research scientists developing large language models, DeepSeek-V4's approach to long-context processing offers a significant architectural blueprint. You should investigate its hybrid compressed attention and Manifold-Constrained Hyper-Connections (mHC) as potential strategies to reduce the computational overhead of million-token contexts in your own designs, balancing performance with resource efficiency.

Key insights

DeepSeek-V4 uses hybrid compressed attention and mHC to enable million-token context efficiently.

Principles

Method

The model stores compressed summaries of past tokens, selectively retrieves them, and maintains an exact local window for recent tokens.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Salt - Curated AI.