Running 26B and 35B LLMs at Full Speed on €990 of Used Hardware — No Cloud Required
Summary
The article details how a €990 secondhand PC, equipped with an RTX 4070 (12 GB) and an RTX 2070 SUPER (8 GB) for a combined 20 GB VRAM, can run 26B and 35B LLMs at high speeds locally. This setup achieves 82.6 tokens/second (tok/s) for Gemma 4 26B and 73 tok/s for Qwen3.6 35B-A3B using "llama.cpp" on antiX Linux. The build matches the performance of a single RTX 3090, which costs €800–1,000 used. The author emphasizes using "llama.cpp" over Ollama for fine-grained control, noting a 35–38% throughput gain. Crucially, speculative decoding via Multi-Token Prediction (MTP) was found to be detrimental on this dual-GPU PCIe topology, causing a 41% loss for Gemma 4 26B and a 9% slowdown for Qwen3.6 35B, contrary to common claims. The analysis also highlights the surprising power efficiency of Mixture-of-Experts (MoE) models, with the 35B MoE drawing less power (108 W GPU) than a 12B dense model (204 W GPU), resulting in a cost of approximately €0.22 per million tokens. This demonstrates a significant shift in local inference capabilities by mid-2026.
Key takeaway
For AI Engineers evaluating cost-effective local inference solutions, this analysis demonstrates that a €990 dual-GPU setup (RTX 4070 + RTX 2070 SUPER) can deliver 73-83 tok/s for 26B-35B LLMs, matching single RTX 3090 performance. You should prioritize "llama.cpp" for its control surface and carefully benchmark optimizations like speculative decoding, as it proved detrimental on PCIe-limited systems. Consider MoE models for their power efficiency and privacy benefits, enabling real-time, private inference for cents per million tokens.
Key insights
Cost-effective local LLM inference is achievable with dual secondhand GPUs, outperforming single high-end cards and revealing speculative decoding's limitations on PCIe.
Principles
- Dual-GPU setups can combine VRAM for high-performance inference.
- Fine-grained control over LLM inference software is crucial for optimization.
- Sustained generation benchmarks reveal true performance, unlike short bursts.
Method
Assemble a dual-GPU system with combined VRAM. Utilize "llama.cpp" for precise control over layer distribution and expert offloading. Measure performance on sustained 2000-token generations. Isolate variables when benchmarking.
In practice
- Pair an RTX 4070 (12 GB) with an RTX 2070 SUPER (8 GB) for 20 GB VRAM.
- Use "llama.cpp" with "--tensor-split 0.62,0.38" for Gemma 4 26B.
- For Qwen3.6 35B, use IQ4_XS quantization and "-ncmoe" with a 70/30 split.
Topics
- Local LLM Inference
- Dual-GPU Hardware
- llama.cpp Optimization
- MoE Model Efficiency
- Speculative Decoding
- GPU Benchmarking
Code references
Best for: Machine Learning Engineer, AI Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AI Advances - Medium.