Towards Understanding, Analyzing, and Optimizing Agentic AI Execution: A CPU-Centric Perspective
Summary
Agentic AI frameworks, which integrate decision-making orchestrators with external tools like web search and Python interpreters, introduce significant CPU-centric bottlenecks often overlooked in performance analysis. This study systematically characterizes agentic AI based on orchestrator type (LLM or host), inference path dynamics (static or dynamic), and flow repetitiveness (single-step or multi-step). Profiling five representative agentic AI workloads—Haystack RAG, Toolformer, ChemCrow, LangChain, and SWE-Agent—on an Intel Emerald Rapids CPU and NVIDIA B200 GPU system, researchers found that CPU tool processing can account for up to 90.6% of total latency. Agentic throughput is bottlenecked by either CPU factors (coherence, synchronization, core over-subscription) or GPU factors (memory capacity, bandwidth), and CPU dynamic energy can consume up to 44% of total dynamic energy at large batch sizes. To address these issues, two optimizations, CPU and GPU-Aware Micro-batching (CGAM) and Mixed Agentic Workload Scheduling (MAWS), were developed, achieving up to 2.1× and 1.41× P50 latency speedup for homogeneous and heterogeneous workloads, respectively.
Key takeaway
For AI architects and MLOps engineers deploying agentic AI, optimizing CPU-bound tool processing is critical for performance and efficiency. Implement CPU and GPU-Aware Micro-batching (CGAM) with a batching cap of 64 to reduce P50 latency by up to 2.1× and save CPU energy. For heterogeneous workloads, consider Mixed Agentic Workload Scheduling (MAWS) to adaptively use multi-processing for CPU-heavy tasks and multi-threading for LLM-heavy tasks, improving overall P99 latency by 1.17×.
Key insights
CPU-centric bottlenecks in agentic AI workloads significantly impact latency, throughput, and energy consumption, often more than GPU inference.
Principles
- Tool processing on CPUs can dominate agentic AI latency.
- Agentic throughput is limited by CPU or GPU resource saturation.
- CPU parallelism is less energy-efficient than GPU parallelism at scale.
Method
The study characterizes agentic AI by orchestrator, path, and repetitiveness, then profiles latency, throughput, and energy across five workloads, leading to CGAM and MAWS optimizations.
In practice
- Cap batch sizes to optimize median and tail latencies.
- Overlap CPU and GPU processing for mixed workloads.
- Use multi-processing for CPU-bound tasks, multi-threading for memory-intensive ones.
Topics
- Agentic AI Workloads
- CPU Bottlenecks
- GPU Bottlenecks
- Performance Profiling
- Tool Processing
Code references
- crewAIInc/crewAI
- deepset-ai/haystack
- microsoft/semantic-kernel
- yoheinakajima/babyagi
- SWE-agent/mini-swe-agent
Best for: AI Architect, MLOps Engineer, Research Scientist, AI Engineer, Machine Learning Engineer, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.