Run DeepSeek V4 on Intel® CPUs and GPUs
Summary
The recently released DeepSeek V4 introduces several frontier architectural optimizations, including a hybrid attention mechanism (Compressed Sparse Attention and Heavily Compressed Attention) that reduces KV cache usage by up to 90%. Its architecture also implements manifold-constrained Hyper-Connections (mHC) for enhanced expressiveness and training stability, and a massive Mixture-of-Experts (MoE) architecture natively trained with the MXFP4 data format, enabling advanced capabilities with a minimal computational footprint. This blog post details the steps for running DeepSeek V4 on Intel® Xeon® CPUs and Intel® Arc™ GPUs using SGLang, providing Docker-based setup and command-line instructions for both platforms to launch an OpenAI-compatible server and query models like DeepSeek-V4-Pro and DeepSeek-V4-Flash.
Key takeaway
For MLOps Engineers deploying DeepSeek V4, this guide confirms that Intel Xeon CPUs and Arc GPUs are now viable platforms. You can utilize SGLang's tailored kernels and Docker setup to achieve efficient inference, reducing KV cache usage by up to 90% and benefiting from MXFP4 MoE. Consider integrating these Intel-optimized solutions to expand your hardware options for DeepSeek V4 deployments.
Key insights
DeepSeek V4 leverages hybrid attention, mHC, and MXFP4 MoE for efficiency and expressiveness, now runnable on Intel CPUs/GPUs via SGLang.
Principles
- Hybrid attention reduces KV cache usage by up to 90%.
- mHC improves model expressiveness and training stability.
- MXFP4-trained MoE enhances computational efficiency.
Method
The article details a Docker-based setup for SGLang, followed by launching an OpenAI-compatible server for DeepSeek V4 models on Intel Xeon CPUs or Arc GPUs, then querying via curl.
In practice
- Use SGLang's Dockerfiles for Intel CPU/GPU environment setup.
- Launch an OpenAI-compatible server with `sglang serve`.
- Query DeepSeek V4 models via standard API calls.
Topics
- DeepSeek V4
- Intel Xeon CPUs
- Intel Arc GPUs
- SGLang
- Mixture-of-Experts
- LLM Inference
- Sparse Attention
Code references
Best for: Machine Learning Engineer, AI Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence (AI) articles.