Improved performance and model support with GGUF
Summary
Ollama 0.30, released on June 5, 2026, introduces significant enhancements including improved performance and expanded GGUF model compatibility, achieved through integration with llama.cpp. This update complements Ollama's existing MLX engine on Apple silicon, broadening support for various models across a wider array of hardware. Specifically, NVIDIA hardware now experiences up to 20% faster throughput, benefiting from optimizations by NVIDIA and llama.cpp teams, as demonstrated with the Gemma 4 26B model on an NVIDIA RTX 5090 using Q4_K_M quantization. Furthermore, Vulkan is now enabled by default, extending GPU acceleration to AMD and Intel devices without requiring vendor-specific libraries. The release also enhances model compatibility, allowing more GGUF models, including LFM, Prism, and Unsloth fine-tuned models, to run out of the box, with tool calling capabilities preserved for use with coding agents.
Key takeaway
For MLOps Engineers deploying local LLMs on diverse hardware, Ollama 0.30 significantly simplifies model integration and boosts performance. You can now utilize GGUF models from Hugging Face directly, including those with tool calling, across NVIDIA, AMD, and Intel GPUs without complex library setups. This update means faster inference, up to 20% on NVIDIA, and broader hardware compatibility, streamlining your local development and deployment workflows. Consider updating to capitalize on these performance and compatibility gains immediately.
Key insights
Ollama 0.30 significantly boosts performance and broadens model/hardware compatibility via GGUF and llama.cpp integration.
Principles
- GGUF compatibility expands model ecosystem access.
- Vulkan enables broad GPU acceleration without vendor libraries.
- Tool calling capabilities persist with GGUF models.
Method
To run a GGUF model, download the file, create a Modelfile pointing to its path, then use "ollama create" and "ollama run" commands.
In practice
- Run Gemma 4 26B on NVIDIA RTX 5090 for 20% faster throughput.
- Utilize GGUF models with coding agents like Claude Code or Hermes Agent.
Topics
- Ollama 0.30
- GGUF Models
- llama.cpp
- GPU Acceleration
- Vulkan API
- Tool Calling
- Coding Agents
Code references
Best for: NLP Engineer, AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Ollama Blog.