A Practical Guide to Running LLMs on AMD Radeon™ GPUs

2026-06-19 · Source: AMD ROCm Blogs · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, long

Summary

A practical guide details how to run large language models (LLMs) on AMD Radeon integrated and discrete GPUs, leveraging open-source tooling for local AI inference. The guide covers setup and configuration for optimal performance using frameworks like Lemonade, LM Studio, Ollama, and llama.cpp. It explains converting PyTorch checkpoints to the GGUF format, which is supported by these tools, and provides step-by-step instructions for building llama.cpp with ROCm (recommended for best performance) or Vulkan backends on both Windows and Linux. Additionally, it outlines model quantization options (e.g., Q4_K_M, Q8_0) to reduce memory footprint and details command-line execution with "llama-cli", including key parameters like "-ngl 33" for GPU offloading and context window sizes like 4096 or 8192. Python bindings for "llama-cpp-python" with Vulkan support are also covered, demonstrating chat completion with Phi-3.5 models.

Key takeaway

For AI Engineers or ML Students aiming to deploy LLMs on AMD Radeon GPUs, this guide provides actionable steps to achieve efficient local inference. You should prioritize building "llama.cpp" with the ROCm backend for best performance and convert models to the GGUF format for broad tool compatibility. When running models, explicitly set `HIP_VISIBLE_DEVICES` for multi-GPU systems and configure context window sizes like 4096 or 8192 to manage VRAM effectively, ensuring optimal performance and avoiding memory errors.

Key insights

Running LLMs locally on AMD Radeon GPUs is now practical via open-source tools and GGUF models.

Principles

GGUF is a unified format for efficient LLM execution.
ROCm backend offers optimal performance for AMD GPUs.
Quantization reduces memory footprint with minimal quality loss.

Method

The guide outlines converting PyTorch models to GGUF, building llama.cpp with ROCm or Vulkan, quantizing models (e.g., to Q4_K_M), and running them via "llama-cli", Ollama, LM Studio, or Lemonade.

In practice

Use HIP_VISIBLE_DEVICES for multi-GPU selection.
Set "-ngl -1" or "33" to offload all layers to GPU.
Limit context window ("-c 4096") to avoid out-of-memory.

Topics

AMD Radeon GPUs
Large Language Models
GGUF Model Format
llama.cpp
ROCm Software
Model Quantization
Local AI Inference

Code references

Best for: AI Engineer, Machine Learning Engineer, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.