How to Reduce LLM Inference Cost and Improve Accuracy with Pass@k and Majority Voting

· Source: The Kaitchup – AI on a Budget · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Intermediate, medium

Summary

Large Language Models (LLMs) exhibit random sampling at inference time, meaning they may produce different answers to the same question due to sampling from a distribution of plausible next tokens rather than always choosing the single most likely one. This randomness, controlled by hyperparameters like temperature and top-p, can be leveraged to improve accuracy for tasks with single correct answers. Techniques such as pass@k and maj@k involve running the model multiple times and either selecting a correct answer if one exists (pass@k) or taking the most common answer (maj@k). These methods are particularly effective when combined with disabling the model's "thinking" process, which significantly reduces inference cost, allowing for more samples. For instance, Qwen3.5 27B (NVFP4) with thinking disabled can achieve higher pass@k scores at a lower overall cost than with thinking enabled, as demonstrated on benchmarks like LiveCodeBench and AIME.

Key takeaway

For AI Engineers optimizing LLM deployment, consider implementing pass@k or maj@k strategies, particularly with models where "thinking" can be disabled. This approach can significantly reduce inference costs while often outperforming single-run accuracy of more expensive, "thinking-enabled" configurations. Evaluate the cost-benefit trade-off for your specific task, as the gains vary across benchmarks like LiveCodeBench versus MMLU-Pro.

Key insights

Random sampling at inference can be leveraged for accuracy gains via multiple attempts, especially with cost-optimized models.

Principles

Method

Run LLMs multiple times with random sampling (pass@k or maj@k) to improve accuracy, especially by disabling costly "thinking" processes and reinvesting savings into more samples.

In practice

Topics

Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Kaitchup – AI on a Budget.