SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation

2026-06-17 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Robotics & Autonomous Systems, Computer Vision & Pattern Recognition, Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

SC3-Eval is a self-consistent video generation recipe designed to evaluate generalist robot manipulation policies, addressing the high cost and scalability issues of real-world testing. This method adapts a pre-trained video foundation model into an accurate policy evaluator by enforcing three complementary forms of consistency. It uses forward-inverse dynamics to predict frames from actions and recover actions, anchoring rollouts to a physically plausible action manifold. Cross-view consistency ensures multi-camera observation coherence by inpainting views from others. Test-time consistency leverages inverse dynamics as an uncertainty signal, terminating rollouts when generated frames drift from requested actions. SC3-Eval reproduces real-world policy failure modes for diagnostic comparison, achieving a closed-loop Pearson correlation of 0.929 and an MMRV of 0.119 across seven real-world vision-language-action policies. It outperforms three prior video-model-based baselines and generalizes to new tasks.

Key takeaway

For Robotics Engineers evaluating generalist manipulation policies, SC3-Eval provides a robust, scalable alternative to expensive real-world rollouts. You can utilize its self-consistent video generation to accurately simulate policy behaviors, diagnose failure modes, and compare performance across different vision-language-action policies. This approach allows you to accelerate development cycles and reduce testing costs significantly, ensuring your policies are robust before real-world deployment.

Key insights

SC3-Eval uses self-consistent video generation to accurately evaluate robot policies, overcoming real-world testing limitations and autoregressive errors.

Principles

Forward-inverse dynamics anchors generated rollouts.
Cross-view consistency maintains multi-camera coherence.
Test-time consistency detects and terminates drifting rollouts.

Method

SC3-Eval adapts a pre-trained video foundation model. It jointly trains for forward-inverse dynamics, cross-view inpainting, and uses inverse dynamics at inference for uncertainty-based rollout termination.

In practice

Diagnose robot policy failure modes.
Scale robot policy evaluation.
Compare vision-language-action policies.

Topics

Robot Foundation Models
Policy Evaluation
Video Generation
Robot Manipulation
Computer Vision
Self-Consistent Learning

Best for: Research Scientist, AI Scientist, Robotics Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.