MARS: Margin-Adversarial Risk-controlled Stopping for Parallel LLM Test-time Scaling
Summary
MARS, or Margin-Adversarial Risk-controlled Stopping, is a novel method designed to reduce the computational overhead of parallel LLM test-time scaling, which typically involves running many reasoning traces to completion and majority-voting their answers. This approach observes that partial traces can be probed at intermediate checkpoints to extract evolving aggregate votes without disrupting generation. MARS implements a margin-adversarial stopping rule that estimates the likelihood of active traces changing their answers. It stops once the leading answer is deemed stable under a conservative bound on future vote movement, separating uncertainty into trace-level switch probabilities and an adversarial bound from warmup traces. A five-feature logistic model effectively predicts switching behavior. Across three reasoning models and three competition-math benchmarks, MARS saves 25-47% of self-consistency tokens and 14-29% over DeepConf Online, all while maintaining the accuracy of full-budget baselines.
Key takeaway
For Machine Learning Engineers optimizing LLM inference costs, MARS presents a compelling solution to the computational overhead of parallel test-time scaling. By dynamically stopping reasoning traces when an answer is stable, you can achieve 25-47% token savings for self-consistency and 14-29% over baselines like DeepConf Online, all while preserving full-budget accuracy. Evaluate integrating adaptive stopping rules like MARS to significantly reduce operational expenses for your LLM applications.
Key insights
MARS reduces LLM test-time costs by adaptively stopping parallel reasoning traces while preserving accuracy.
Principles
- Probing partial traces reveals evolving aggregate votes.
- Estimate trace-level switch probabilities for early stopping.
- Use adversarial bounds for future vote movement.
Method
MARS probes partial LLM reasoning traces, estimates answer stability using a margin-adversarial stopping rule, and halts generation when the leading vote is secure, leveraging a five-feature logistic model for switch prediction.
In practice
- Save 25-47% self-consistency tokens.
- Improve efficiency over DeepConf Online by 14-29%.
- Match full-budget LLM accuracy.
Topics
- LLM Inference
- Self-Consistency
- Computational Efficiency
- Adaptive Stopping
- Margin-Adversarial Stopping
- Reasoning Models
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.