Improved performance and model support with GGUF

2026-06-04 · Source: Ollama Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Intermediate, quick

Summary

Ollama 0.30, released on June 5, 2026, introduces significant enhancements including improved performance and expanded GGUF model compatibility, achieved through integration with llama.cpp. This update complements Ollama's existing MLX engine on Apple silicon, broadening support for various models across a wider array of hardware. Specifically, NVIDIA hardware now experiences up to 20% faster throughput, benefiting from optimizations by NVIDIA and llama.cpp teams, as demonstrated with the Gemma 4 26B model on an NVIDIA RTX 5090 using Q4_K_M quantization. Furthermore, Vulkan is now enabled by default, extending GPU acceleration to AMD and Intel devices without requiring vendor-specific libraries. The release also enhances model compatibility, allowing more GGUF models, including LFM, Prism, and Unsloth fine-tuned models, to run out of the box, with tool calling capabilities preserved for use with coding agents.

Key takeaway

For MLOps Engineers deploying local LLMs on diverse hardware, Ollama 0.30 significantly simplifies model integration and boosts performance. You can now utilize GGUF models from Hugging Face directly, including those with tool calling, across NVIDIA, AMD, and Intel GPUs without complex library setups. This update means faster inference, up to 20% on NVIDIA, and broader hardware compatibility, streamlining your local development and deployment workflows. Consider updating to capitalize on these performance and compatibility gains immediately.

Key insights

Ollama 0.30 significantly boosts performance and broadens model/hardware compatibility via GGUF and llama.cpp integration.

Principles

GGUF compatibility expands model ecosystem access.
Vulkan enables broad GPU acceleration without vendor libraries.
Tool calling capabilities persist with GGUF models.

Method

To run a GGUF model, download the file, create a Modelfile pointing to its path, then use "ollama create" and "ollama run" commands.

In practice

Run Gemma 4 26B on NVIDIA RTX 5090 for 20% faster throughput.
Utilize GGUF models with coding agents like Claude Code or Hermes Agent.

Topics

Ollama 0.30
GGUF Models
llama.cpp
GPU Acceleration
Vulkan API
Tool Calling
Coding Agents

Code references

ggml-org/llama.cpp

Best for: NLP Engineer, AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Ollama Blog.