AI Inference on AMD Ryzen™ AI Max Processor

· Source: AMD ROCm Blogs · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, medium

Summary

AMD Ryzen™ AI Max+ processors, featuring AMD Radeon™ 8060S integrated graphics and up to 128 GB of Unified Memory Architecture (UMA), significantly advance local large language model (LLM) inference. This architecture allows the CPU and GPU to share a single memory pool, enabling models with 100 billion or more parameters to run on a single system without dedicated VRAM limitations or cloud costs. The article demonstrates this capability by running Qwen3.5 models (9B, 35B-A3B, and 122B-A10B) on an AMD Ryzen™ AI Max+ 395 system with 64 GB GPU-accessible memory, Ubuntu 24.04 LTS, ROCm 7.2.1, and Ollama 0.20.x. Benchmarks show the 35B-A3B model achieving 42.04 tok/s and the 76 GB 122B-A10B model running at 8.59 tok/s using CPU/GPU mixed loading.

Key takeaway

For AI Engineers evaluating or self-hosting large language models, the AMD Ryzen™ AI Max+ processor offers a compelling local inference solution. Its Unified Memory Architecture allows you to run 100B+ parameter models like Qwen3.5 122B-A10B directly on a single system, eliminating the need for multi-GPU rigs or expensive cloud endpoints. You should consider this hardware for cost-effective, high-capacity LLM development and deployment, leveraging Ollama for streamlined setup and management.

Key insights

AMD Ryzen™ AI Max+ processors enable local 100B+ parameter LLM inference via Unified Memory Architecture, reducing cloud dependency.

Principles

Method

Install Ollama on Ubuntu with AMD ROCm, then pull and run Qwen3.5 models. Verify GPU acceleration with "ollama ps" and monitor memory with "rocm-smi".

In practice

Topics

Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.