Speed-up JAX LLM Training on Intel® Xeon® 6 CPU: Activation Offloading on Heterogeneous Systems

· Source: Artificial Intelligence (AI) articles · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Advanced, short

Summary

JAX-based Activation Offloading on Intel® Xeon® 6 with P-cores systems provides an effective memory management strategy for large-scale language model training on heterogeneous CPU–GPU systems. This method repurposes the CPU's large DDR5 host memory as a live activation store, transferring intermediate activations from GPU HBM to CPU DRAM during the forward pass and retrieving them during backpropagation. Leveraging XLA's asynchronous compute–communication overlap, this approach avoids throughput loss by amortizing transfer latency across concurrent compute operations. Evaluated across PaliGemma2-28B, Llama 3.3-70B, and Gemma4-31B on configurations featuring dual-socket Intel Xeon 6 processors with 8x NVIDIA H200 or B300 GPUs, the technique consistently outperforms full rematerialization. Per-step training time improvements ranged from 3% to 15.5%, with specific gains including 11.6% for PaliGemma2-28B on H200, 6.9% for Llama 3.3-70B on H200, 3% for Llama 3.3-70B on B300, and 15.5% for Gemma4-31B on B300. This positions the Intel Xeon 6 processor as an active contributor to GPU-driven LLM training, enhancing GPU utilization and reducing total cost of ownership.

Key takeaway

For AI Architects and MLOps Engineers scaling large language model training, you should consider integrating Intel Xeon 6 with P-cores processors as active memory offloading nodes. This strategy, leveraging JAX-based activation offloading, can significantly improve GPU utilization and training throughput by 3% to 15.5%, reducing your total cost of ownership. Evaluate offloading attention mechanism activations (Q, K, V) to extend effective memory and optimize your heterogeneous CPU-GPU training pipelines.

Key insights

JAX-based activation offloading to Intel Xeon 6 CPU memory improves LLM training throughput by 3-15.5% on heterogeneous systems.

Principles

Method

Transfer GPU HBM activations to CPU DDR during the forward pass (D2H) and retrieve them during backpropagation (H2D), leveraging JAX/XLA's asynchronous dispatch.

In practice

Topics

Best for: AI Engineer, CTO, VP of Engineering/Data, Machine Learning Engineer, AI Architect, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence (AI) articles.