Local Agentic Programming on the Cheap: Claude Code + Ollama + Gemma4
Summary
This article details building a local, cost-effective agentic programming stack using Ollama, Google DeepMind's Gemma 4, and Claude Code. It focuses on Gemma 4 26B MoE, released April 2, 2026, under Apache 2.0, which activates 3.8 billion parameters and achieves 77.1% on LiveCodeBench v6 and 86.4% on τ2-bench for agentic tool use. The setup requires ~16–18 GB VRAM for a 256K context window. The guide covers installing Ollama and Claude Code, then configuring a Modelfile to override Ollama's default 4K context to 65536 tokens, setting temperature to 0.2, and adding a specific system prompt for agentic coding. It also explains wiring Claude Code to the local Ollama endpoint via "settings.json" and provides a Python script to verify the setup's health and tool-calling functionality. Common issues like tool parameter errors, context window swapping, and model unloading are addressed with specific fixes.
Key takeaway
For AI Engineers seeking to reduce cloud API costs and enhance privacy for agentic coding, implementing a local stack with Ollama, Gemma 4, and Claude Code is highly effective. You should configure a custom Modelfile to ensure adequate context window and low temperature, then verify tool-calling functionality with the provided script. This setup enables private, zero-cost execution of tasks like code analysis and test generation, freeing up cloud resources for more complex architectural challenges.
Key insights
Local agentic coding with Gemma 4 and Claude Code offers a private, cost-free alternative to cloud LLMs for daily engineering tasks.
Principles
- Open-weight LLMs like Gemma 4 can match cloud models for agentic coding.
- Modelfiles are crucial for optimizing local LLM context and behavior.
- Low temperature improves tool call reliability in agentic loops.
Method
Install Ollama and Claude Code. Create a Modelfile for Gemma 4 to set context (65536 tokens), temperature (0.2), and system prompt. Configure Claude Code's "settings.json" to point to Ollama's local endpoint. Verify setup with a Python script.
In practice
- Use "num_ctx 65536" in Modelfile to prevent context window failures.
- Set "temperature 0.2" to ensure reliable tool call formatting.
- Export "OLLAMA_KEEP_ALIVE=-1" to prevent model unloading delays.
Topics
- Local LLMs
- Agentic Programming
- Gemma 4
- Ollama
- Claude Code
- Modelfile
- Tool Calling
Code references
Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by KDnuggets.