A Practical Guide to Running LLMs on AMD Radeon™ GPUs
Summary
A practical guide details how to run large language models (LLMs) on AMD Radeon integrated and discrete GPUs, leveraging open-source tooling for local AI inference. The guide covers setup and configuration for optimal performance using frameworks like Lemonade, LM Studio, Ollama, and llama.cpp. It explains converting PyTorch checkpoints to the GGUF format, which is supported by these tools, and provides step-by-step instructions for building llama.cpp with ROCm (recommended for best performance) or Vulkan backends on both Windows and Linux. Additionally, it outlines model quantization options (e.g., Q4_K_M, Q8_0) to reduce memory footprint and details command-line execution with "llama-cli", including key parameters like "-ngl 33" for GPU offloading and context window sizes like 4096 or 8192. Python bindings for "llama-cpp-python" with Vulkan support are also covered, demonstrating chat completion with Phi-3.5 models.
Key takeaway
For AI Engineers or ML Students aiming to deploy LLMs on AMD Radeon GPUs, this guide provides actionable steps to achieve efficient local inference. You should prioritize building "llama.cpp" with the ROCm backend for best performance and convert models to the GGUF format for broad tool compatibility. When running models, explicitly set `HIP_VISIBLE_DEVICES` for multi-GPU systems and configure context window sizes like 4096 or 8192 to manage VRAM effectively, ensuring optimal performance and avoiding memory errors.
Key insights
Running LLMs locally on AMD Radeon GPUs is now practical via open-source tools and GGUF models.
Principles
- GGUF is a unified format for efficient LLM execution.
- ROCm backend offers optimal performance for AMD GPUs.
- Quantization reduces memory footprint with minimal quality loss.
Method
The guide outlines converting PyTorch models to GGUF, building llama.cpp with ROCm or Vulkan, quantizing models (e.g., to Q4_K_M), and running them via "llama-cli", Ollama, LM Studio, or Lemonade.
In practice
- Use HIP_VISIBLE_DEVICES for multi-GPU selection.
- Set "-ngl -1" or "33" to offload all layers to GPU.
- Limit context window ("-c 4096") to avoid out-of-memory.
Topics
- AMD Radeon GPUs
- Large Language Models
- GGUF Model Format
- llama.cpp
- ROCm Software
- Model Quantization
- Local AI Inference
Code references
- ggerganov/llama.cpp
- ggerganov/llama.cpp
- microsoft/Phi-3.5-mini-instruct
- ggml-org/llama.cpp
- abetlen/llama-cpp-python
Best for: AI Engineer, Machine Learning Engineer, AI Student
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.