MARS: Margin-Adversarial Risk-controlled Stopping for Parallel LLM Test-time Scaling
Summary
MARS (Margin-Adversarial Risk-controlled Stopping) is a novel early-stopping rule designed for parallel LLM test-time scaling, which typically samples many reasoning traces and majority-votes their answers. This method addresses the substantial computational overhead of running all traces to completion by probing partial traces at intermediate checkpoints. MARS estimates the likelihood of active traces changing their answers and stops generation when the leading answer's margin is robust against potential future vote shifts. It separates uncertainty into trace-level switch probabilities, learned by a five-feature logistic model, and adversarial bounds for switching trace destinations, calibrated from warmup traces. Across three reasoning models and three competition-math benchmarks, MARS saves 25–47% of self-consistency tokens and an additional 14–29% on DeepConf Online, while matching full-budget accuracy within 0.6 percentage points and improving it by up to 0.8 points in some cases.
Key takeaway
For MLOps Engineers deploying parallel LLM reasoning systems, you should consider integrating MARS to significantly reduce inference costs and latency. This method allows you to achieve 25–47% token savings on self-consistency and 14–29% on DeepConf Online without sacrificing accuracy. Evaluate MARS by running a small set of warmup traces to calibrate its switch probability model and adversarial bounds, ensuring robust early stopping.
Key insights
MARS uses margin-adversarial risk control to safely stop parallel LLM inference early, preserving accuracy while reducing token usage.
Principles
- Safety requires margin certification, not just consensus stability.
- Separate "whether" a trace switches from "where" it lands.
- Calibrate adversarial bounds from warmup traces.
Method
MARS probes active traces for current answers, estimates switch probabilities using a 5-feature logistic model, and computes maximum adversarial margin loss. Stopping occurs when the leader's margin exceeds this loss plus a concentration correction.
In practice
- Implement intermediate probing for parallel LLM traces.
- Train a lightweight logistic model on trace history for switch prediction.
- Calibrate a contraction parameter (gamma) from warmup runs.
Topics
- LLM Inference Optimization
- Early Stopping
- Parallel Decoding
- Self-Consistency
- Margin-Adversarial Control
- Computational Efficiency
Code references
Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.