AI Inference on AMD Ryzen™ AI Max Processor
Summary
AMD Ryzen™ AI Max+ processors, featuring AMD Radeon™ 8060S integrated graphics and up to 128 GB of Unified Memory Architecture (UMA), significantly advance local large language model (LLM) inference. This architecture allows the CPU and GPU to share a single memory pool, enabling models with 100 billion or more parameters to run on a single system without dedicated VRAM limitations or cloud costs. The article demonstrates this capability by running Qwen3.5 models (9B, 35B-A3B, and 122B-A10B) on an AMD Ryzen™ AI Max+ 395 system with 64 GB GPU-accessible memory, Ubuntu 24.04 LTS, ROCm 7.2.1, and Ollama 0.20.x. Benchmarks show the 35B-A3B model achieving 42.04 tok/s and the 76 GB 122B-A10B model running at 8.59 tok/s using CPU/GPU mixed loading.
Key takeaway
For AI Engineers evaluating or self-hosting large language models, the AMD Ryzen™ AI Max+ processor offers a compelling local inference solution. Its Unified Memory Architecture allows you to run 100B+ parameter models like Qwen3.5 122B-A10B directly on a single system, eliminating the need for multi-GPU rigs or expensive cloud endpoints. You should consider this hardware for cost-effective, high-capacity LLM development and deployment, leveraging Ollama for streamlined setup and management.
Key insights
AMD Ryzen™ AI Max+ processors enable local 100B+ parameter LLM inference via Unified Memory Architecture, reducing cloud dependency.
Principles
- UMA allows CPU and GPU to share a single memory pool.
- MoE architectures can increase token generation speed.
- GPU-accessible memory is configurable via BIOS settings.
Method
Install Ollama on Ubuntu with AMD ROCm, then pull and run Qwen3.5 models. Verify GPU acceleration with "ollama ps" and monitor memory with "rocm-smi".
In practice
- Use Qwen3.5 35B-A3B for fast interactive chat (~42 tok/s).
- Run Qwen3.5 122B-A10B for large-capacity local workloads (~8.6 tok/s).
- Adjust context length and close memory-intensive apps for large models.
Topics
- LLM Inference
- Unified Memory Architecture
- AMD Ryzen AI Max+
- Ollama
- ROCm
- Qwen3.5
- Local AI
Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.