ATOM: Unlocking Extreme AMD Instinct Inference with Software-Hardware Co-Optimization

2026-06-15 · Source: AMD ROCm Blogs · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Advanced, medium

Summary

ATOM (AiTer Optimized Model) is an inference engine designed to maximize efficiency for LLM serving on AMD Instinct™ GPUs, addressing challenges like high concurrency and multi-GPU deployment. It operates as the system-level inference engine within the AMD AI stack, orchestrating execution while leveraging AITER for kernel acceleration and MoRI for distributed communication. ATOM supports standalone serving with OpenAI-compatible APIs and integrates with vLLM and SGLang ecosystems. Its architecture coordinates scheduling, KV cache management, and various parallelism strategies (TP/DP/EP). Key features include continuous batching, prefix caching, Level 3 compilation, FP8/MXFP4/INT8/INT4 quantization, and MTP speculative decoding. ATOM covers major model families like Llama, Qwen, DeepSeek, and Mixtral, optimizing for Dense, MoE, and inference-enhanced workloads. A public benchmark dashboard and official recipes aid deployment and tuning.

Key takeaway

For AI Engineers deploying LLMs on AMD Instinct GPUs, ATOM offers a unified, high-performance inference engine. You should utilize ATOM directly for its optimized execution across Dense, MoE, and MTP-enabled models, or use its architecture and recipes as a reference for tuning other frameworks. This approach can stabilize throughput and reduce per-model optimization overhead, ensuring extreme performance.

Key insights

ATOM is a ROCm-first, co-optimized inference engine for extreme LLM performance on AMD Instinct GPUs.

Principles

System-level optimization is crucial for LLM inference.
Deep hardware-software co-optimization yields peak efficiency.
Unified execution frameworks reduce per-model tuning.

Method

ATOM orchestrates LLM inference by dispatching requests through an LLMEngine to EngineCore's, where a Scheduler manages batching and ModelRunner executes forward passes using optimized AITER kernels and parallelism strategies.

In practice

Use ATOM's benchmark dashboard for nightly performance tracking.
Employ official ATOM recipes for reproducible model deployments.
Combine benchmark data with profiler traces to diagnose bottlenecks.

Topics

LLM Inference
AMD Instinct GPUs
ATOM Inference Engine
Software-Hardware Co-optimization
Distributed Inference
Quantization

Code references

ROCm/ATOM

Best for: Machine Learning Engineer, AI Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.