How Much Does It Actually Cost to Run a Local LLM? (€ per Million Tokens, Measured)

· Source: LLM on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Data Science & Analytics · Depth: Intermediate, medium

Summary

The article measures the actual energy cost of running local LLMs on a single RTX 3090 (24 GB) machine named "ardi" using `ollama` and `HomeLab Monitor`. It compares three `Q4_K_M`-quantized GGUF Gemma models: `gemma3:1b` (1B params), `gemma4:26b` (25.8B params), and `gemma3:27b` (27B params). The cost is calculated in euros per million output tokens, based on real GPU energy consumption sampled from `nvidia-smi` and a €0.30 day / €0.18 night electricity tariff. The benchmark involved a 4-minute, 256-token generation loop for each model. Results showed `gemma3:1b` cost €0.118/M tokens, `gemma4:26b` cost €0.272/M tokens, and `gemma3:27b` cost €0.706/M tokens. This revealed that the largest model was more expensive than cloud Flash-class APIs (~€0.55/M tokens), while smaller models were significantly cheaper. The study emphasizes that cost per token scales worse than linearly with model size, and architecture plays a crucial role in efficiency.

Key takeaway

For AI Engineers evaluating local LLM deployments, prioritize the smallest model that meets your quality requirements. Your assumption that local inference is inherently free or cheaper than cloud APIs is often incorrect for larger, less efficient models. You could be paying more in electricity alone than for a cloud Flash API. Use tools like `HomeLab Monitor` to measure actual per-token energy costs, ensuring your local setup truly delivers cost savings, especially under high utilization.

Key insights

Local LLM energy costs vary significantly by model size and architecture, with larger models potentially exceeding cloud API costs.

Principles

Method

Measure GPU power draw via `nvidia-smi` every 10 seconds, integrate over a fixed workload's duration, and multiply by real electricity tariffs to calculate €/M output tokens.

In practice

Topics

Code references

Best for: Machine Learning Engineer, AI Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by LLM on Medium.