Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning

2026-05-03 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

A new study audits the multimodal physics evaluation pipeline, identifying three key issues: train-eval contamination, translation drift, and multiple-choice question (MCQ) saturation. Traditional single-stage 5-gram-Jaccard audits failed to detect contamination, but a three-stage audit (Jaccard, mxbai-embed-large cosine, Haiku-4.5 LLM-judge) found 134 near-duplicates and 4,846 paraphrase candidates in SciInstruct alone. The study also revealed a 17-percentage point (pp) performance drop for Sonnet-4.5 on English translations compared to original Estonian physics problems (30.5% vs. 13.6%). Furthermore, identical Sonnet-4.5 weights showed a 46-pp performance gradient between MCQ (79.7% on PhyX) and open-ended olympiad evaluations (33.4% on PhysOlym-A). To address these, the researchers released four artifacts: PhysCorp-A (a 6,432-record audited multimodal corpus), PhysR1Corp (a 2,268-record closed-form RL pool), PhysOlym-A (a 500-problem novel-source held-out olympiad evaluation), and Physics-R1, a reference GSPO+DAPO recipe cold-started from Qwen3-VL-8B-Thinking, which improved performance by +18.3 pp on PhysOlym-A liberal.

Key takeaway

For AI scientists and machine learning engineers developing or evaluating vision-language models for physics reasoning, you should adopt a rigorous three-stage audit protocol for your training and evaluation datasets to prevent contamination. Be aware that model performance can significantly degrade on translated problems and that multiple-choice formats can inflate scores. Prioritize open-ended, novel-source evaluations like PhysOlym-A to accurately gauge model capabilities and avoid misleading benchmark results.

Key insights

Multimodal physics reasoning benchmarks are flawed by contamination, translation effects, and evaluation format biases.

Principles

Three-stage auditing is essential for robust benchmark cleanliness.
Translation can significantly alter model performance on identical problems.
Evaluation format and problem novelty impact model scores by large margins.

Method

A three-stage audit pipeline (n-gram Jaccard, embedding cosine, LLM-judge) identifies near-duplicates and paraphrases. A binary correctness reward is recommended for RL training over dense, multi-component rewards.

In practice

Implement a three-stage audit for new VLM benchmarks.
Prioritize original-language gold standards for multilingual evaluations.
Use open-ended, novel-source evaluations to assess true VLM capability.

Topics

Multimodal Physics Reasoning
Evaluation Pipeline Audit
Train-Eval Contamination
Translation Drift
MCQ Saturation

Code references

shanyang-me/physics-r1-neurips2026

Best for: AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.