AI Inference on AMD Ryzen™ AI Max Processor

2026-05-25 · Source: AMD ROCm Blogs · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, medium

Summary

AMD Ryzen™ AI Max+ processors, featuring AMD Radeon™ 8060S integrated graphics and up to 128 GB of Unified Memory Architecture (UMA), significantly advance local large language model (LLM) inference. This architecture allows the CPU and GPU to share a single memory pool, enabling models with 100 billion or more parameters to run on a single system without dedicated VRAM limitations or cloud costs. The article demonstrates this capability by running Qwen3.5 models (9B, 35B-A3B, and 122B-A10B) on an AMD Ryzen™ AI Max+ 395 system with 64 GB GPU-accessible memory, Ubuntu 24.04 LTS, ROCm 7.2.1, and Ollama 0.20.x. Benchmarks show the 35B-A3B model achieving 42.04 tok/s and the 76 GB 122B-A10B model running at 8.59 tok/s using CPU/GPU mixed loading.

Key takeaway

For AI Engineers evaluating or self-hosting large language models, the AMD Ryzen™ AI Max+ processor offers a compelling local inference solution. Its Unified Memory Architecture allows you to run 100B+ parameter models like Qwen3.5 122B-A10B directly on a single system, eliminating the need for multi-GPU rigs or expensive cloud endpoints. You should consider this hardware for cost-effective, high-capacity LLM development and deployment, leveraging Ollama for streamlined setup and management.

Key insights

AMD Ryzen™ AI Max+ processors enable local 100B+ parameter LLM inference via Unified Memory Architecture, reducing cloud dependency.

Principles

UMA allows CPU and GPU to share a single memory pool.
MoE architectures can increase token generation speed.
GPU-accessible memory is configurable via BIOS settings.

Method

Install Ollama on Ubuntu with AMD ROCm, then pull and run Qwen3.5 models. Verify GPU acceleration with "ollama ps" and monitor memory with "rocm-smi".

In practice

Use Qwen3.5 35B-A3B for fast interactive chat (~42 tok/s).
Run Qwen3.5 122B-A10B for large-capacity local workloads (~8.6 tok/s).
Adjust context length and close memory-intensive apps for large models.

Topics

LLM Inference
Unified Memory Architecture
AMD Ryzen AI Max+
Ollama
ROCm
Qwen3.5
Local AI

Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.