Running Granite 4 Language Models with Ollama

· Source: Niklas Heidloff · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, long

Summary

IBM has released the Granite 4 family of open-source language models, which significantly reduce memory requirements for AI inference by up to 70% compared to traditional Transformer-based models. This efficiency is achieved through a hybrid architecture combining Transformer and Mamba layers, allowing for linear scaling of context windows rather than quadratic. The Granite 4 models come in various sizes, including 32B (9B activated), 7B (1B activated), 3B, 1.5B, and 350M parameters, catering to diverse applications from enterprise RAG to edge computing. These models support features like tool calling, RAG, FIM, and structured output, and can be run locally via Ollama, with quantized Q4 versions suitable for most modern notebooks. The smallest models can even operate in browsers using Transformers.js, WebGPU, and ONNX.

Key takeaway

For AI Engineers and Machine Learning Engineers focused on optimizing inference costs and deploying LLMs on resource-constrained hardware, the IBM Granite 4 models offer a compelling solution. Their hybrid Transformer/Mamba architecture provides up to a 70% reduction in memory usage, enabling local execution via Ollama or even in-browser deployment. You should explore these models for applications requiring large context windows, low latency, or edge deployment to significantly lower operational costs and expand accessibility.

Key insights

IBM's Granite 4 models use a hybrid Transformer/Mamba architecture to drastically cut AI inference memory needs.

Principles

Method

Run Granite 4 models locally using Ollama, specifying quantization (e.g., Q4_K_M or Q8). For browser-based use, leverage Transformers.js, WebGPU, and ONNX.

In practice

Topics

Best for: AI Engineer, Machine Learning Engineer, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Niklas Heidloff.