How to Reduce LLM Inference Cost and Improve Accuracy with Pass@k and Majority Voting
Summary
Large Language Models (LLMs) exhibit random sampling at inference time, meaning they may produce different answers to the same question due to sampling from a distribution of plausible next tokens rather than always choosing the single most likely one. This randomness, controlled by hyperparameters like temperature and top-p, can be leveraged to improve accuracy for tasks with single correct answers. Techniques such as pass@k and maj@k involve running the model multiple times and either selecting a correct answer if one exists (pass@k) or taking the most common answer (maj@k). These methods are particularly effective when combined with disabling the model's "thinking" process, which significantly reduces inference cost, allowing for more samples. For instance, Qwen3.5 27B (NVFP4) with thinking disabled can achieve higher pass@k scores at a lower overall cost than with thinking enabled, as demonstrated on benchmarks like LiveCodeBench and AIME.
Key takeaway
For AI Engineers optimizing LLM deployment, consider implementing pass@k or maj@k strategies, particularly with models where "thinking" can be disabled. This approach can significantly reduce inference costs while often outperforming single-run accuracy of more expensive, "thinking-enabled" configurations. Evaluate the cost-benefit trade-off for your specific task, as the gains vary across benchmarks like LiveCodeBench versus MMLU-Pro.
Key insights
Random sampling at inference can be leveraged for accuracy gains via multiple attempts, especially with cost-optimized models.
Principles
- Randomness enhances LLM creativity and reasoning.
- Disabling "thinking" reduces inference cost significantly.
- Cost savings can fund multiple samples for accuracy.
Method
Run LLMs multiple times with random sampling (pass@k or maj@k) to improve accuracy, especially by disabling costly "thinking" processes and reinvesting savings into more samples.
In practice
- Use pass@k for verifiable tasks like code generation.
- Apply maj@k for non-verifiable tasks like math or MCQ.
- Experiment with disabling "thinking" for cost-effective accuracy.
Topics
- LLM Inference Optimization
- Pass@k Sampling
- Majority Voting (Maj@k)
- Disabled Reasoning
- Qwen3.5 27B
Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The Kaitchup – AI on a Budget.