Sandwich: Joint Configuration Search and Hot-Switching for Efficient CPU LLM Serving

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

Sandwich is a hardware-centric CPU-based LLM serving engine designed to optimize inference by separating and independently optimizing the prefill and decode phases. Unlike existing solutions that use static model partitions and vendor libraries, Sandwich employs distinct execution plans for each phase, leveraging a tree-based hardware abstraction called TopoTree to explore optimal core utilization and model partitioning. It also features a "fast-start-then-finetune" approach for dynamic-shape tensor program generation, reducing kernel tuning costs significantly. Evaluated across five CPU platforms, including x86 with AVX-2/AVX-512 and ARM with NEON, Sandwich achieves an average 2.01x throughput improvement, 90% satisfactory time-to-first-token (TTFT) and time-per-output-token (TPOT) latencies, and up to 3.40x lower requirements in single sequence serving, alongside substantial Goodput improvements in continuous-batching.

Key takeaway

For MLOps engineers optimizing LLM inference on CPU clusters, consider adopting a phase-aware serving architecture like Sandwich. Its ability to dynamically adapt core utilization and tensor program generation for prefill (compute-intensive) and decode (memory-intensive) phases can significantly boost throughput and reduce latency, potentially allowing you to meet stricter Service Level Objectives (SLOs) without costly GPU investments. Evaluate your current CPU utilization and memory hierarchy to identify opportunities for similar phase-specific optimizations.

Key insights

Optimizing CPU LLM serving requires separate execution plans for prefill and decode phases due to their distinct computational and memory demands.

Principles

Method

Sandwich uses TopoTree for hardware abstraction, applying group and remove transformations to explore service configurations. It generates dynamic-shape tensor programs via a fast-start-then-finetune strategy, coupled with micro-kernel sliding window and tensor schedule reuse.

In practice

Topics

Code references

Best for: MLOps Engineer, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.