Teaching AI agents to ask better questions by playing “Battleship”
Summary
MIT researchers from CSAIL and SEAS developed "Collaborative Battleship" to test AI agents' question-asking abilities, finding that smaller models can outperform larger ones at 1 percent of the cost. They created the "BattleshipQA" dataset from human play. Initially, top LMs like GPT-5 beat humans, but smaller systems like Llama 4 Scout struggled. By implementing a Monte Carlo inference strategy, Llama 4 Scout's win rate against humans jumped from 8 percent to 82 percent, surpassing GPT-5's performance. Additionally, converting questions into code boosted answer accuracy by 15 percent on average, with GPT-4o-mini seeing a nearly 30 percent bump. This approach also improved performance in "Guess Who?", with Llama 4 Scout reaching 72 percent success and GPT-4o 90 percent.
Key takeaway
For Machine Learning Engineers developing AI agents for information-seeking tasks, consider integrating explicit inference strategies and code-based verification. Your agents can achieve superior performance and cost-efficiency, as demonstrated by Llama 4 Scout outperforming GPT-5 at 1 percent of the cost. Focus on equipping models with "world models" and question-to-code conversion to enhance their exploration and information gathering capabilities in uncertain environments.
Key insights
AI agents ask better questions and make discoveries more efficiently when given access to a "world model" and explicit verification methods.
Principles
- Monte Carlo inference improves question utility.
- Converting questions to code boosts answer accuracy.
- Smaller models can exceed large model performance.
Method
Implement a Monte Carlo inference strategy to weigh options and convert natural language questions into executable code for answer verification.
In practice
- Apply Monte Carlo for agent exploration.
- Use code generation for answer validation.
- Evaluate smaller LMs with enhanced strategies.
Topics
- AI Agents
- Language Models
- Monte Carlo Inference
- Question Answering
- Information Seeking
- Model Efficiency
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by MIT News - Artificial intelligence.