The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective
Summary
Foundation model agents are increasingly deployed for real-world decision-making, but suffer from the sim-to-real gap. This 2026 paper proposes formalizing this evaluation and training gap as a classical sim-to-real problem, structured around the four elements of a Markov Decision Process (Observation, Action, Transition, Reward). It argues against treating agent robustness as a novel phenomenon, advocating for adopting established solutions like domain randomization from robotics and classical control. The research agenda translates classical discrepancies into the foundation model domain, providing concrete examples such as multilingual tool calling. For instance, GPT5 and Qwen-Next-80B showed error rate increases from 13.5% to 28.5% and 5.5% to 46.5% respectively when instructions transferred from English to Chinese due to parameter value language mismatch. The ultimate goal is a unified vocabulary and standardized stress test benchmarks for highly trustworthy agents.
Key takeaway
For machine learning engineers deploying foundation model agents, you must proactively address the sim-to-real gap by adopting established MDP-based frameworks. Your evaluation should systematically stress-test agents across observation, action, transition, and reward discrepancies. This approach, including techniques like domain randomization, will prevent critical real-world failures and ensure your agents are robust and trustworthy in production.
Key insights
The sim-to-real gap in foundation model agents can be effectively addressed by applying established Markov Decision Process frameworks from classical control and robotics.
Principles
- Formalize FM agent robustness via MDP elements.
- Adopt classical sim-to-real solutions like domain randomization.
- Standardized stress tests are crucial for trustworthy agents.
Method
The proposed method involves decomposing FM agent evaluation and training gaps into MDP elements (Observation, Action, Transition, Reward) and applying classical mitigation techniques, such as domain randomization and grounded action transformation, to each.
In practice
- Inject noise into text observations for robustness.
- Expand action space with distractors to test disambiguation.
- Vary transition fidelity with timeouts and partial responses.
Topics
- Foundation Model Agents
- Sim-to-Real Gap
- Markov Decision Process
- Domain Randomization
- Multilingual Tool Calling
- Agent Robustness
Best for: Research Scientist, NLP Engineer, AI Scientist, Machine Learning Engineer, Robotics Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.