MPCoT: Reward-Guided Multi-Path Latent Reasoning for Test-Time Scalable Vision-Language-Action

2026-06-06 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

MPCoT, a reward-guided multi-path latent reasoning framework, addresses brittleness in Vision-Language-Action (VLA) policies for long-horizon and high-uncertainty control. It initializes M latent hypotheses, refines them for K weight-tied steps, and softly aggregates them before action decoding. A training-only path-preference objective aligns the latent path scorer with downstream execution quality using expert-action consistency, world-model/VLM-based progress, and success feedback. MPCoT preserves the original 8-step action interface, generates zero reasoning tokens, and exposes configurable inference controls (K,M). It improves long-horizon performance on LIBERO and CALVIN benchmarks, increasing LIBERO average SR from 96.8 to 98.9 and CALVIN 4/5-step SR by 13.3/16.5 points over OpenVLA-OFT, with only a 2.7% parameter increase.

Key takeaway

For Machine Learning Engineers developing VLA policies for robotic manipulation, MPCoT offers a robust approach to enhance decision stability and cross-environment generalization. By implementing its latent reasoning module, you can scale inference compute through configurable depth (K) and width (M) controls, improving long-horizon performance on complex tasks without introducing token latency or changing the action interface.

Key insights

MPCoT enhances VLA policies via reward-guided multi-path latent reasoning for improved long-horizon control without explicit tokens.

Principles

Latent reasoning offers efficient deliberation.
Depth (K) and width (M) controls scale compute.
Reward-guided path preference aligns latent scores.

Method

MPCoT initializes M latent hypotheses, refines each for K weight-tied steps, then softly aggregates them using a reward-guided scorer before action decoding.

In practice

Configure (K,M) for depth/width control.
Use reward-guided path preference for training.
Integrate as a deliberation layer before action head.

Topics

Vision-Language-Action
Latent Reasoning
Robot Manipulation
Multi-Path Reasoning
Reward-Guided Learning
Test-Time Scaling

Code references

EDGSCOUT/MPCoT

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.