Running 26B and 35B LLMs at Full Speed on €990 of Used Hardware — No Cloud Required

2026-06-10 · Source: AI Advances - Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, long

Summary

The article details how a €990 secondhand PC, equipped with an RTX 4070 (12 GB) and an RTX 2070 SUPER (8 GB) for a combined 20 GB VRAM, can run 26B and 35B LLMs at high speeds locally. This setup achieves 82.6 tokens/second (tok/s) for Gemma 4 26B and 73 tok/s for Qwen3.6 35B-A3B using "llama.cpp" on antiX Linux. The build matches the performance of a single RTX 3090, which costs €800–1,000 used. The author emphasizes using "llama.cpp" over Ollama for fine-grained control, noting a 35–38% throughput gain. Crucially, speculative decoding via Multi-Token Prediction (MTP) was found to be detrimental on this dual-GPU PCIe topology, causing a 41% loss for Gemma 4 26B and a 9% slowdown for Qwen3.6 35B, contrary to common claims. The analysis also highlights the surprising power efficiency of Mixture-of-Experts (MoE) models, with the 35B MoE drawing less power (108 W GPU) than a 12B dense model (204 W GPU), resulting in a cost of approximately €0.22 per million tokens. This demonstrates a significant shift in local inference capabilities by mid-2026.

Key takeaway

For AI Engineers evaluating cost-effective local inference solutions, this analysis demonstrates that a €990 dual-GPU setup (RTX 4070 + RTX 2070 SUPER) can deliver 73-83 tok/s for 26B-35B LLMs, matching single RTX 3090 performance. You should prioritize "llama.cpp" for its control surface and carefully benchmark optimizations like speculative decoding, as it proved detrimental on PCIe-limited systems. Consider MoE models for their power efficiency and privacy benefits, enabling real-time, private inference for cents per million tokens.

Key insights

Cost-effective local LLM inference is achievable with dual secondhand GPUs, outperforming single high-end cards and revealing speculative decoding's limitations on PCIe.

Principles

Dual-GPU setups can combine VRAM for high-performance inference.
Fine-grained control over LLM inference software is crucial for optimization.
Sustained generation benchmarks reveal true performance, unlike short bursts.

Method

Assemble a dual-GPU system with combined VRAM. Utilize "llama.cpp" for precise control over layer distribution and expert offloading. Measure performance on sustained 2000-token generations. Isolate variables when benchmarking.

In practice

Pair an RTX 4070 (12 GB) with an RTX 2070 SUPER (8 GB) for 20 GB VRAM.
Use "llama.cpp" with "--tensor-split 0.62,0.38" for Gemma 4 26B.
For Qwen3.6 35B, use IQ4_XS quantization and "-ncmoe" with a 70/30 split.

Topics

Local LLM Inference
Dual-GPU Hardware
llama.cpp Optimization
MoE Model Efficiency
Speculative Decoding
GPU Benchmarking

Code references

Best for: Machine Learning Engineer, AI Engineer, MLOps Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AI Advances - Medium.