New model scheduling
Summary
Ollama released a significantly improved model scheduling system on September 23, 2025, which precisely measures memory requirements before running a model, replacing previous estimation methods. This enhancement reduces out-of-memory crashes, maximizes GPU utilization, and improves multi-GPU performance, especially with mismatched GPUs. The new system also provides accurate memory utilization reporting in tools like `nvidia-smi` that now align with `ollama ps`. For example, `gemma3:12b` with a 128k context on an NVIDIA GeForce RTX 4090 saw token generation speed increase from 52.02 to 85.54 tokens/s. Similarly, `mistral-small3.2` with image input on two RTX 4090s improved prompt evaluation from 127.84 to 1380.24 tokens/s. This feature is enabled by default for models like `gpt-oss`, `llama4`, `gemma3`, `qwen3`, and `mistral-small3.2`, with more models transitioning soon.
Key takeaway
For NLP Engineers and MLOps teams deploying large language models, Ollama's updated scheduling system means your models will run more reliably and efficiently. You should update to the latest Ollama version to benefit from reduced out-of-memory errors and significantly faster token generation, especially when using long contexts or multi-GPU setups. This update directly impacts your operational costs and model throughput.
Key insights
Ollama's new model scheduler precisely measures memory, boosting GPU utilization and multi-GPU performance.
Principles
- Exact memory allocation prevents OOM errors.
- Optimized scheduling improves multi-GPU efficiency.
Method
The new engine measures the exact memory required for a model before execution, rather than relying on estimations, to optimize resource allocation and improve performance.
In practice
- Use `ollama ps` for accurate memory tracking.
- Deploy `gemma3:12b` for long context tasks.
- Utilize `mistral-small3.2` for image input.
Topics
- Model Scheduling
- GPU Optimization
- Memory Management
- Multi-GPU Performance
- LLM Performance
Best for: NLP Engineer, Computer Vision Engineer, MLOps Engineer, AI Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Ollama Blog.