Speed-up JAX LLM Training on Intel® Xeon® 6 CPU: Activation Offloading on Heterogeneous Systems
Summary
JAX-based Activation Offloading on Intel® Xeon® 6 with P-cores systems provides an effective memory management strategy for large-scale language model training on heterogeneous CPU–GPU systems. This method repurposes the CPU's large DDR5 host memory as a live activation store, transferring intermediate activations from GPU HBM to CPU DRAM during the forward pass and retrieving them during backpropagation. Leveraging XLA's asynchronous compute–communication overlap, this approach avoids throughput loss by amortizing transfer latency across concurrent compute operations. Evaluated across PaliGemma2-28B, Llama 3.3-70B, and Gemma4-31B on configurations featuring dual-socket Intel Xeon 6 processors with 8x NVIDIA H200 or B300 GPUs, the technique consistently outperforms full rematerialization. Per-step training time improvements ranged from 3% to 15.5%, with specific gains including 11.6% for PaliGemma2-28B on H200, 6.9% for Llama 3.3-70B on H200, 3% for Llama 3.3-70B on B300, and 15.5% for Gemma4-31B on B300. This positions the Intel Xeon 6 processor as an active contributor to GPU-driven LLM training, enhancing GPU utilization and reducing total cost of ownership.
Key takeaway
For AI Architects and MLOps Engineers scaling large language model training, you should consider integrating Intel Xeon 6 with P-cores processors as active memory offloading nodes. This strategy, leveraging JAX-based activation offloading, can significantly improve GPU utilization and training throughput by 3% to 15.5%, reducing your total cost of ownership. Evaluate offloading attention mechanism activations (Q, K, V) to extend effective memory and optimize your heterogeneous CPU-GPU training pipelines.
Key insights
JAX-based activation offloading to Intel Xeon 6 CPU memory improves LLM training throughput by 3-15.5% on heterogeneous systems.
Principles
- CPU host memory extends GPU HBM capacity.
- Asynchronous data transfer hides latency.
- Target attention Q, K, V tensors for offloading.
Method
Transfer GPU HBM activations to CPU DDR during the forward pass (D2H) and retrieve them during backpropagation (H2D), leveraging JAX/XLA's asynchronous dispatch.
In practice
- Implement JAX/XLA for async compute-communication.
- Focus offloading on attention Q, K, V tensors.
- Utilize Intel Xeon 6 as an active memory buffer.
Topics
- JAX
- Activation Offloading
- Intel Xeon 6
- LLM Training
- Heterogeneous Computing
- GPU Acceleration
Best for: AI Engineer, CTO, VP of Engineering/Data, Machine Learning Engineer, AI Architect, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence (AI) articles.