The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective

2026-06-08 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Advanced, extended

Summary

Foundation model agents are increasingly deployed for real-world decision-making, but suffer from the sim-to-real gap. This 2026 paper proposes formalizing this evaluation and training gap as a classical sim-to-real problem, structured around the four elements of a Markov Decision Process (Observation, Action, Transition, Reward). It argues against treating agent robustness as a novel phenomenon, advocating for adopting established solutions like domain randomization from robotics and classical control. The research agenda translates classical discrepancies into the foundation model domain, providing concrete examples such as multilingual tool calling. For instance, GPT5 and Qwen-Next-80B showed error rate increases from 13.5% to 28.5% and 5.5% to 46.5% respectively when instructions transferred from English to Chinese due to parameter value language mismatch. The ultimate goal is a unified vocabulary and standardized stress test benchmarks for highly trustworthy agents.

Key takeaway

For machine learning engineers deploying foundation model agents, you must proactively address the sim-to-real gap by adopting established MDP-based frameworks. Your evaluation should systematically stress-test agents across observation, action, transition, and reward discrepancies. This approach, including techniques like domain randomization, will prevent critical real-world failures and ensure your agents are robust and trustworthy in production.

Key insights

The sim-to-real gap in foundation model agents can be effectively addressed by applying established Markov Decision Process frameworks from classical control and robotics.

Principles

Formalize FM agent robustness via MDP elements.
Adopt classical sim-to-real solutions like domain randomization.
Standardized stress tests are crucial for trustworthy agents.

Method

The proposed method involves decomposing FM agent evaluation and training gaps into MDP elements (Observation, Action, Transition, Reward) and applying classical mitigation techniques, such as domain randomization and grounded action transformation, to each.

In practice

Inject noise into text observations for robustness.
Expand action space with distractors to test disambiguation.
Vary transition fidelity with timeouts and partial responses.

Topics

Foundation Model Agents
Sim-to-Real Gap
Markov Decision Process
Domain Randomization
Multilingual Tool Calling
Agent Robustness

Best for: Research Scientist, NLP Engineer, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.