Power-SMC: Low-Latency Sequence-Level Power Sampling for Training-Free LLM Reasoning

2026-02-12 · Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, short

Summary

Power-SMC is a novel, training-free Sequential Monte Carlo scheme designed to achieve low-latency sequence-level power sampling for large language model (LLM) reasoning. This method targets the sequence-level power distribution $\pi_{\alpha}(y\mid x)\propto p_{\theta}(y\mid x)^{\alpha}$ (where $\alpha>1$), which concentrates probability mass on high-likelihood sequences without altering model parameters. Unlike prior Metropolis–Hastings (MH) sampling approaches that incur significant inference slowdowns (16-28x), Power-SMC reduces latency to 1.4-3.3x over baseline decoding by advancing a small particle set in parallel, correcting importance weights token-by-token, and resampling within a single GPU-friendly batched decode. The approach includes an exponent-bridging schedule, $\alpha$-ramping, to improve particle stability and is proven to match or exceed MH power sampling performance on the MATH500 benchmark.

Key takeaway

For AI Engineers and Research Scientists optimizing LLM inference for reasoning tasks, Power-SMC offers a critical advancement. If your current Metropolis–Hastings sampling incurs unacceptable latency, adopting Power-SMC can reduce inference slowdowns from 16-28x to 1.4-3.3x while maintaining or improving reasoning performance. You should investigate integrating this training-free, batch-parallel approach to enhance the efficiency of your LLM deployments.

Key insights

Power-SMC enables efficient sequence-level power sampling for LLM reasoning, significantly reducing latency compared to prior methods.

Principles

Distribution sharpening enhances LLM reasoning.
Sequence-level power distribution biases generation.
SMC can approximate target distributions with weighted samples.

Method

Power-SMC uses a particle-based Sequential Monte Carlo scheme, advancing parallel candidate continuations, updating weights token-by-token, and resampling when weights become uneven, all within a batched decode.

In practice

Apply $\alpha$-ramping for particle stability.
Use $\tau=1/\alpha$ for optimal prefix-only proposals.
Implement cache-safe KV-cache reindexing.

Topics

LLM Reasoning
Power Sampling
Sequential Monte Carlo
Low-Latency Inference
Distribution Sharpening

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.