CAPS: Cascaded Adaptive Pairwise Selection for Efficient Parallel Reasoning

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

CAPS (Cascaded Adaptive Pairwise Selection) is an inference-only framework designed to optimize parallel reasoning in large language models by reducing verifier compute costs. It addresses the inefficiency of existing pairwise self-verification methods, which perform numerous full-solution judgments regardless of informativeness. CAPS allocates verifier compute non-uniformly across an evidence axis, adapting how much of each candidate solution the judge sees, and a distribution axis, adapting how comparisons are spread across the candidate pool. This framework implements a four-stage cascade with an optional rescue subroutine, resulting in a closed-form verifier-token cost that roughly halves the per-candidate marginal cost compared to uniform full-evidence schedules. Evaluated on four self-verifying models (Qwen3-14B, GPT-OSS-20B, Qwen3-4B-Instruct/Thinking) and five reasoning benchmarks (LiveCodeBench-v5/v6, CodeContests, AIME 2025, HMMT 2025), CAPS outperformed the leading pairwise verifier on 14 of 20 suites using only 25.4% of its verifier-token budget on code, and surpassed pointwise self-verification on all 20 suites.

Key takeaway

For NLP engineers optimizing large language model inference costs, adopting CAPS can significantly reduce verifier-token budgets while maintaining or improving performance. You should consider integrating this cascaded adaptive selection framework, especially for reasoning tasks, to achieve substantial compute savings, as demonstrated by its 74.6% reduction in verifier-token budget on code benchmarks.

Key insights

CAPS optimizes LLM parallel reasoning by adaptively reducing verifier compute through cascaded pairwise selection.

Principles

Method

CAPS employs a four-stage cascade with an optional rescue subroutine, adapting verifier compute along evidence and distribution axes to reduce token cost in parallel reasoning.

In practice

Topics

Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.