ATOM: Unlocking Extreme AMD Instinct Inference with Software-Hardware Co-Optimization
Summary
ATOM (AiTer Optimized Model) is an inference engine designed to maximize efficiency for LLM serving on AMD Instinct™ GPUs, addressing challenges like high concurrency and multi-GPU deployment. It operates as the system-level inference engine within the AMD AI stack, orchestrating execution while leveraging AITER for kernel acceleration and MoRI for distributed communication. ATOM supports standalone serving with OpenAI-compatible APIs and integrates with vLLM and SGLang ecosystems. Its architecture coordinates scheduling, KV cache management, and various parallelism strategies (TP/DP/EP). Key features include continuous batching, prefix caching, Level 3 compilation, FP8/MXFP4/INT8/INT4 quantization, and MTP speculative decoding. ATOM covers major model families like Llama, Qwen, DeepSeek, and Mixtral, optimizing for Dense, MoE, and inference-enhanced workloads. A public benchmark dashboard and official recipes aid deployment and tuning.
Key takeaway
For AI Engineers deploying LLMs on AMD Instinct GPUs, ATOM offers a unified, high-performance inference engine. You should utilize ATOM directly for its optimized execution across Dense, MoE, and MTP-enabled models, or use its architecture and recipes as a reference for tuning other frameworks. This approach can stabilize throughput and reduce per-model optimization overhead, ensuring extreme performance.
Key insights
ATOM is a ROCm-first, co-optimized inference engine for extreme LLM performance on AMD Instinct GPUs.
Principles
- System-level optimization is crucial for LLM inference.
- Deep hardware-software co-optimization yields peak efficiency.
- Unified execution frameworks reduce per-model tuning.
Method
ATOM orchestrates LLM inference by dispatching requests through an LLMEngine to EngineCore's, where a Scheduler manages batching and ModelRunner executes forward passes using optimized AITER kernels and parallelism strategies.
In practice
- Use ATOM's benchmark dashboard for nightly performance tracking.
- Employ official ATOM recipes for reproducible model deployments.
- Combine benchmark data with profiler traces to diagnose bottlenecks.
Topics
- LLM Inference
- AMD Instinct GPUs
- ATOM Inference Engine
- Software-Hardware Co-optimization
- Distributed Inference
- Quantization
Code references
Best for: Machine Learning Engineer, AI Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.