Sandwich: Joint Configuration Search and Hot-Switching for Efficient CPU LLM Serving
Summary
Sandwich is a hardware-centric CPU-based LLM serving engine designed to optimize inference by separating and independently optimizing the prefill and decode phases. Unlike existing solutions that use static model partitions and vendor libraries, Sandwich employs distinct execution plans for each phase, leveraging a tree-based hardware abstraction called TopoTree to explore optimal core utilization and model partitioning. It also features a "fast-start-then-finetune" approach for dynamic-shape tensor program generation, reducing kernel tuning costs significantly. Evaluated across five CPU platforms, including x86 with AVX-2/AVX-512 and ARM with NEON, Sandwich achieves an average 2.01x throughput improvement, 90% satisfactory time-to-first-token (TTFT) and time-per-output-token (TPOT) latencies, and up to 3.40x lower requirements in single sequence serving, alongside substantial Goodput improvements in continuous-batching.
Key takeaway
For MLOps engineers optimizing LLM inference on CPU clusters, consider adopting a phase-aware serving architecture like Sandwich. Its ability to dynamically adapt core utilization and tensor program generation for prefill (compute-intensive) and decode (memory-intensive) phases can significantly boost throughput and reduce latency, potentially allowing you to meet stricter Service Level Objectives (SLOs) without costly GPU investments. Evaluate your current CPU utilization and memory hierarchy to identify opportunities for similar phase-specific optimizations.
Key insights
Optimizing CPU LLM serving requires separate execution plans for prefill and decode phases due to their distinct computational and memory demands.
Principles
- Prefill is compute-bound, decode is memory-bound.
- Jointly optimize computation slices and polymerization schemes.
- Active core numbers and locations impact performance.
Method
Sandwich uses TopoTree for hardware abstraction, applying group and remove transformations to explore service configurations. It generates dynamic-shape tensor programs via a fast-start-then-finetune strategy, coupled with micro-kernel sliding window and tensor schedule reuse.
In practice
- Use TopoTree to model CPU memory hierarchy.
- Employ fast-start for initial kernel expansion.
- Limit active CPU cores to reduce memory contention.
Topics
- CPU LLM Serving
- Prefill-Decode Optimization
- Dynamic-Shape Tensor Programs
- NUMA Systems
- TopoTree Abstraction
Code references
Best for: MLOps Engineer, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.