Hermes Agent with Qwen3.6 (Local) | AI App Template with LangChain, LangGraph, llama.cpp | π΄ Live
Summary
This content details an attempt to use the Hermes agent with a locally run Qwen 3.6 35B 4-bit quantized model on a Mac M4 Pro with 48 GB unified memory, utilizing llama.cpp for local inference. The primary goal was to develop a starter template for AI applications, including a backend (FastAPI, Python, LangChain, LangGraph, DocLink, Postgres with PG Vector) and a frontend (Next.js with TypeScript) for a RAG implementation. The setup involved configuring llama.cpp server with a temperature of 0.6 for coding tasks, consuming 27-28 GB of memory, and then setting up Hermes agent version 0.10 in a Docker container, pointing it to the local llama.cpp server. Customizations included adding a "Karpathy guidelines" skill to the agent's system prompt to encourage simpler, goal-driven implementations. Initial interactions revealed challenges, including the model hallucinating Next.js versions and struggling with complex, multi-step coding tasks, often getting stuck or producing suboptimal code, suggesting limitations for agentic coding with this specific local LLM setup.
Key takeaway
For AI Engineers building agentic applications with local LLMs, carefully consider the trade-offs of model quantization and hardware. While 4-bit Qwen 3.6 on an M4 Pro can run, its performance in complex coding tasks with agents like Hermes is suboptimal. Prioritize higher quantization (e.g., 8-bit) if your hardware allows, and be prepared for iterative refinement rather than one-shot solutions, potentially using more powerful frontier models for benchmarking complex agentic workflows.
Key insights
Local LLMs like Qwen 3.6 face significant challenges in agentic coding tasks, especially with lower quantization.
Principles
- Lower quantization (e.g., 4-bit) can degrade agentic coding performance.
- Persistent storage is crucial for iterative agent development in Docker.
- Agentic LLMs benefit from explicit planning and constrained execution.
Method
Run a quantized LLM (Qwen 3.6 4-bit) locally via llama.cpp server, integrate with Hermes agent in Docker, and customize with specific coding guidelines (Karpathy) for AI application development.
In practice
- Use 8-bit quantization for Qwen 3.6 on M4 Max with 128GB memory.
- Employ `llama.cpp` or `vLLM` for optimal local LLM inference.
- Implement persistent Docker volumes for agent workspaces.
Topics
- Hermes Agent
- Qwen 3.6
- llama.cpp
- Local LLM Inference
- AI Application Development
Best for: Machine Learning Engineer, AI Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Venelin Valkov.