How Much Does It Actually Cost to Run a Local LLM? (€ per Million Tokens, Measured)
Summary
The article measures the actual energy cost of running local LLMs on a single RTX 3090 (24 GB) machine named "ardi" using `ollama` and `HomeLab Monitor`. It compares three `Q4_K_M`-quantized GGUF Gemma models: `gemma3:1b` (1B params), `gemma4:26b` (25.8B params), and `gemma3:27b` (27B params). The cost is calculated in euros per million output tokens, based on real GPU energy consumption sampled from `nvidia-smi` and a €0.30 day / €0.18 night electricity tariff. The benchmark involved a 4-minute, 256-token generation loop for each model. Results showed `gemma3:1b` cost €0.118/M tokens, `gemma4:26b` cost €0.272/M tokens, and `gemma3:27b` cost €0.706/M tokens. This revealed that the largest model was more expensive than cloud Flash-class APIs (~€0.55/M tokens), while smaller models were significantly cheaper. The study emphasizes that cost per token scales worse than linearly with model size, and architecture plays a crucial role in efficiency.
Key takeaway
For AI Engineers evaluating local LLM deployments, prioritize the smallest model that meets your quality requirements. Your assumption that local inference is inherently free or cheaper than cloud APIs is often incorrect for larger, less efficient models. You could be paying more in electricity alone than for a cloud Flash API. Use tools like `HomeLab Monitor` to measure actual per-token energy costs, ensuring your local setup truly delivers cost savings, especially under high utilization.
Key insights
Local LLM energy costs vary significantly by model size and architecture, with larger models potentially exceeding cloud API costs.
Principles
- Cost per token scales worse than linearly with model size.
- Newer model architectures can improve energy efficiency.
- Marginal energy cost is distinct from total cost of ownership.
Method
Measure GPU power draw via `nvidia-smi` every 10 seconds, integrate over a fixed workload's duration, and multiply by real electricity tariffs to calculate €/M output tokens.
In practice
- Use `HomeLab Monitor` for real-time GPU power tracking.
- Quantize models (e.g., `Q4_K_M` GGUF) for local inference.
- Implement a warm-up call before timed benchmarks.
Topics
- Local LLM Costs
- GPU Energy Consumption
- ollama Inference
- HomeLab Monitor
- Gemma Models
- Token Generation Cost
Code references
Best for: Machine Learning Engineer, AI Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by LLM on Medium.