New model scheduling

2025-09-22 · Source: Ollama Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Intermediate, quick

Summary

Ollama released a significantly improved model scheduling system on September 23, 2025, which precisely measures memory requirements before running a model, replacing previous estimation methods. This enhancement reduces out-of-memory crashes, maximizes GPU utilization, and improves multi-GPU performance, especially with mismatched GPUs. The new system also provides accurate memory utilization reporting in tools like `nvidia-smi` that now align with `ollama ps`. For example, `gemma3:12b` with a 128k context on an NVIDIA GeForce RTX 4090 saw token generation speed increase from 52.02 to 85.54 tokens/s. Similarly, `mistral-small3.2` with image input on two RTX 4090s improved prompt evaluation from 127.84 to 1380.24 tokens/s. This feature is enabled by default for models like `gpt-oss`, `llama4`, `gemma3`, `qwen3`, and `mistral-small3.2`, with more models transitioning soon.

Key takeaway

For NLP Engineers and MLOps teams deploying large language models, Ollama's updated scheduling system means your models will run more reliably and efficiently. You should update to the latest Ollama version to benefit from reduced out-of-memory errors and significantly faster token generation, especially when using long contexts or multi-GPU setups. This update directly impacts your operational costs and model throughput.

Key insights

Ollama's new model scheduler precisely measures memory, boosting GPU utilization and multi-GPU performance.

Principles

Exact memory allocation prevents OOM errors.
Optimized scheduling improves multi-GPU efficiency.

Method

The new engine measures the exact memory required for a model before execution, rather than relying on estimations, to optimize resource allocation and improve performance.

In practice

Use `ollama ps` for accurate memory tracking.
Deploy `gemma3:12b` for long context tasks.
Utilize `mistral-small3.2` for image input.

Topics

Model Scheduling
GPU Optimization
Memory Management
Multi-GPU Performance
LLM Performance

Best for: NLP Engineer, Computer Vision Engineer, MLOps Engineer, AI Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Ollama Blog.