OpenAI gpt-oss

2025-08-04 · Source: Ollama Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, quick

Summary

OpenAI has released two new open-weight models, gpt-oss 20B and gpt-oss 120B, in partnership with Ollama, making them available for local deployment as of August 5, 2025. These models are designed for powerful reasoning, agentic tasks, and versatile developer use cases, featuring native function calling, web browsing, Python tool calls, and structured outputs. They support full chain-of-thought access and configurable reasoning effort (low, medium, high). The models are fine-tunable and released under the permissive Apache 2.0 license. To reduce memory footprint, OpenAI utilizes MXFP4 quantization for the mixture-of-experts (MoE) weights, which constitute over 90% of parameters, enabling the 20B model to run on 16GB memory and the 120B model on a single 80GB GPU. Ollama natively supports this MXFP4 format, with new kernels developed for its engine, and has collaborated with NVIDIA to accelerate gpt-oss performance on GeForce RTX and RTX PRO GPUs.

Key takeaway

For AI/ML Directors evaluating new local deployment options, OpenAI's gpt-oss models offer a compelling solution due to their agentic capabilities, Apache 2.0 license, and efficient MXFP4 quantization. You should consider integrating these models, especially the 20B version for specialized tasks or the 120B for general-purpose production, to leverage powerful reasoning on existing NVIDIA RTX hardware. This partnership with Ollama and NVIDIA simplifies deployment and ensures performance.

Key insights

OpenAI's gpt-oss models offer powerful local AI with agentic features and efficient quantization under an Apache 2.0 license.

Principles

Quantization significantly reduces memory footprint.
Open-source models foster broad utility and customization.

Method

The gpt-oss models use MXFP4 quantization for mixture-of-experts weights, reducing memory to enable local execution on GPUs with 16GB or 80GB memory, supported natively by Ollama's new engine.

In practice

Use `ollama run gpt-oss:20b` for lower latency tasks.
Employ `ollama run gpt-oss:120b` for high reasoning production use.

Topics

OpenAI gpt-oss
Large Language Models
Model Quantization
Agentic AI
Ollama Platform

Best for: CTO, VP of Engineering/Data, Director of AI/ML, Machine Learning Engineer, AI Engineer, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Ollama Blog.