Hermes Agent with Qwen3.6 (Local) | AI App Template with LangChain, LangGraph, llama.cpp | 🔴 Live

2026-04-18 · Source: Venelin Valkov · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Intermediate, extended

Summary

This content details an attempt to use the Hermes agent with a locally run Qwen 3.6 35B 4-bit quantized model on a Mac M4 Pro with 48 GB unified memory, utilizing llama.cpp for local inference. The primary goal was to develop a starter template for AI applications, including a backend (FastAPI, Python, LangChain, LangGraph, DocLink, Postgres with PG Vector) and a frontend (Next.js with TypeScript) for a RAG implementation. The setup involved configuring llama.cpp server with a temperature of 0.6 for coding tasks, consuming 27-28 GB of memory, and then setting up Hermes agent version 0.10 in a Docker container, pointing it to the local llama.cpp server. Customizations included adding a "Karpathy guidelines" skill to the agent's system prompt to encourage simpler, goal-driven implementations. Initial interactions revealed challenges, including the model hallucinating Next.js versions and struggling with complex, multi-step coding tasks, often getting stuck or producing suboptimal code, suggesting limitations for agentic coding with this specific local LLM setup.

Key takeaway

For AI Engineers building agentic applications with local LLMs, carefully consider the trade-offs of model quantization and hardware. While 4-bit Qwen 3.6 on an M4 Pro can run, its performance in complex coding tasks with agents like Hermes is suboptimal. Prioritize higher quantization (e.g., 8-bit) if your hardware allows, and be prepared for iterative refinement rather than one-shot solutions, potentially using more powerful frontier models for benchmarking complex agentic workflows.

Key insights

Local LLMs like Qwen 3.6 face significant challenges in agentic coding tasks, especially with lower quantization.

Principles

Lower quantization (e.g., 4-bit) can degrade agentic coding performance.
Persistent storage is crucial for iterative agent development in Docker.
Agentic LLMs benefit from explicit planning and constrained execution.

Method

Run a quantized LLM (Qwen 3.6 4-bit) locally via llama.cpp server, integrate with Hermes agent in Docker, and customize with specific coding guidelines (Karpathy) for AI application development.

In practice

Use 8-bit quantization for Qwen 3.6 on M4 Max with 128GB memory.
Employ `llama.cpp` or `vLLM` for optimal local LLM inference.
Implement persistent Docker volumes for agent workspaces.

Topics

Hermes Agent
Qwen 3.6
llama.cpp
Local LLM Inference
AI Application Development

Best for: Machine Learning Engineer, AI Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Venelin Valkov.