Run DeepSeek V4 on Intel® CPUs and GPUs

2026-05-18 · Source: Artificial Intelligence (AI) articles · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Intermediate, short

Summary

The recently released DeepSeek V4 introduces several frontier architectural optimizations, including a hybrid attention mechanism (Compressed Sparse Attention and Heavily Compressed Attention) that reduces KV cache usage by up to 90%. Its architecture also implements manifold-constrained Hyper-Connections (mHC) for enhanced expressiveness and training stability, and a massive Mixture-of-Experts (MoE) architecture natively trained with the MXFP4 data format, enabling advanced capabilities with a minimal computational footprint. This blog post details the steps for running DeepSeek V4 on Intel® Xeon® CPUs and Intel® Arc™ GPUs using SGLang, providing Docker-based setup and command-line instructions for both platforms to launch an OpenAI-compatible server and query models like DeepSeek-V4-Pro and DeepSeek-V4-Flash.

Key takeaway

For MLOps Engineers deploying DeepSeek V4, this guide confirms that Intel Xeon CPUs and Arc GPUs are now viable platforms. You can utilize SGLang's tailored kernels and Docker setup to achieve efficient inference, reducing KV cache usage by up to 90% and benefiting from MXFP4 MoE. Consider integrating these Intel-optimized solutions to expand your hardware options for DeepSeek V4 deployments.

Key insights

DeepSeek V4 leverages hybrid attention, mHC, and MXFP4 MoE for efficiency and expressiveness, now runnable on Intel CPUs/GPUs via SGLang.

Principles

Hybrid attention reduces KV cache usage by up to 90%.
mHC improves model expressiveness and training stability.
MXFP4-trained MoE enhances computational efficiency.

Method

The article details a Docker-based setup for SGLang, followed by launching an OpenAI-compatible server for DeepSeek V4 models on Intel Xeon CPUs or Arc GPUs, then querying via curl.

In practice

Use SGLang's Dockerfiles for Intel CPU/GPU environment setup.
Launch an OpenAI-compatible server with `sglang serve`.
Query DeepSeek V4 models via standard API calls.

Topics

DeepSeek V4
Intel Xeon CPUs
Intel Arc GPUs
SGLang
Mixture-of-Experts
LLM Inference
Sparse Attention

Code references

Best for: Machine Learning Engineer, AI Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence (AI) articles.