Teaching AI agents to ask better questions by playing “Battleship”

2026-06-03 · Source: MIT News - Artificial intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Intermediate, medium

Summary

MIT researchers from CSAIL and SEAS developed "Collaborative Battleship" to test AI agents' question-asking abilities, finding that smaller models can outperform larger ones at 1 percent of the cost. They created the "BattleshipQA" dataset from human play. Initially, top LMs like GPT-5 beat humans, but smaller systems like Llama 4 Scout struggled. By implementing a Monte Carlo inference strategy, Llama 4 Scout's win rate against humans jumped from 8 percent to 82 percent, surpassing GPT-5's performance. Additionally, converting questions into code boosted answer accuracy by 15 percent on average, with GPT-4o-mini seeing a nearly 30 percent bump. This approach also improved performance in "Guess Who?", with Llama 4 Scout reaching 72 percent success and GPT-4o 90 percent.

Key takeaway

For Machine Learning Engineers developing AI agents for information-seeking tasks, consider integrating explicit inference strategies and code-based verification. Your agents can achieve superior performance and cost-efficiency, as demonstrated by Llama 4 Scout outperforming GPT-5 at 1 percent of the cost. Focus on equipping models with "world models" and question-to-code conversion to enhance their exploration and information gathering capabilities in uncertain environments.

Key insights

AI agents ask better questions and make discoveries more efficiently when given access to a "world model" and explicit verification methods.

Principles

Monte Carlo inference improves question utility.
Converting questions to code boosts answer accuracy.
Smaller models can exceed large model performance.

Method

Implement a Monte Carlo inference strategy to weigh options and convert natural language questions into executable code for answer verification.

In practice

Apply Monte Carlo for agent exploration.
Use code generation for answer validation.
Evaluate smaller LMs with enhanced strategies.

Topics

AI Agents
Language Models
Monte Carlo Inference
Question Answering
Information Seeking
Model Efficiency

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by MIT News - Artificial intelligence.