Karpathy's LLM Wiki with Local Gemma 4 & llama.cpp | Agentic Tool Calling Test

2026-04-10 · Source: Venelin Valkov · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, extended

Summary

This content explores an agentic workflow for Gemma 4, building upon Andrej Karpathy's "LLM Wiki" concept. The core idea is to create an immutable, compounding knowledge base where an LLM manages updates, summarization, and cross-referencing, moving beyond stateless RAG systems. The author demonstrates a local implementation using a 26B parameter Gemma 4 model, quantized to 5-bit (Q5KM), running on a llama.cpp server on an Apple M4 with 48GB unified memory, achieving 40-45 tokens/second. The system, called "Habit Wiki," tracks daily journals, habits, and goals, with the LLM automating bookkeeping and maintenance. The demonstration highlights Gemma 4's capabilities in tool calling (read/write files), understanding code, and evaluating project architecture, while also noting limitations in complex planning compared to larger models like Opus 4.6.

Key takeaway

For AI Engineers exploring local LLM applications, consider implementing Karpathy's LLM Wiki concept to build persistent, evolving knowledge systems. Your focus should be on designing robust index structures and tool-calling mechanisms, as demonstrated with Gemma 4 and llama.cpp, to automate knowledge management and overcome the limitations of stateless RAG, even with smaller, quantized models.

Key insights

Karpathy's LLM Wiki concept offers a vectorless, agentic approach to knowledge management, with LLMs maintaining an evolving, indexed knowledge base.

Principles

LLMs can automate wiki maintenance.
Index files enable selective context loading.
Compounding knowledge improves over stateless RAG.

Method

An LLM acts as a librarian, ingesting new sources, categorizing information, and updating a markdown-based wiki via an index file, performing operations like ingestion, querying, and linking (health checks).

In practice

Run Gemma 4 (26B, 5-bit quant) on 24GB+ VRAM.
Use llama.cpp for local LLM serving.
Implement a "Habit Wiki" for personal tracking.

Topics

Karpathy's LLM Wiki
Gemma 4
Local LLMs
Agentic Tool Calling
llama.cpp

Best for: Machine Learning Engineer, AI Engineer, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Venelin Valkov.