Learning User Simulators with Turing Rewards

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Social Simulation & Behavioral Modeling · Depth: Expert, extended

Summary

Turing-RL, a novel reinforcement learning approach, trains user simulator models to produce responses indistinguishable from real human users. Developed by researchers from MIT, Stanford, and MIT-IBM Watson AI Lab, this method employs a discriminative Turing reward, where an LLM judge (Qwen3.5-397B-A17B) scores how human-like a generated response is, given the user's history. Unlike existing methods that match single ground truth responses, Turing-RL optimizes for indistinguishability. Evaluated on multi-turn chat (PRISM dataset) and Reddit forum discussions (ConvoKit), Turing-RL consistently outperformed baseline methods like Sim-RL and Logprob-RL on both LLM (Claude Sonnet 4.6) and human evaluation metrics. For instance, on Chat, Turing-RL achieved a human win rate of 0.57, significantly higher than SFT-Init and Sim-RL, while maintaining comparable content similarity to ground truth.

Key takeaway

For Machine Learning Engineers developing interactive AI systems, you should consider adopting indistinguishability-based training for user simulators. This approach, exemplified by Turing-RL, significantly improves human-likeness compared to traditional content-matching methods, without sacrificing content alignment. Implement a discriminative reward mechanism with an LLM judge to train your simulator, as this has proven more effective and reliable than human evaluators for identifying human-like responses.

Key insights

Training user simulators for indistinguishability from real users, rather than content matching, yields superior human-like responses.

Principles

Method

Turing-RL trains LLM user simulators using a discriminative Turing reward. An LLM judge scores generated responses against ground truth for indistinguishability, then GRPO optimizes the simulator policy, initialized via SFT with chain-of-thought.

In practice

Topics

Code references

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.