Captaining IMO Gold, Deep Think, On-Policy RL, Feeling the AGI in Singapore — Yi Tay
Summary
Yi Tay, a key figure at Google DeepMind, discusses his journey from Brain to Reka to leading the Reasoning and AGI team in Singapore, focusing on Gemini Deep Think and IMO Gold. He details the IMO Gold effort, a live competition where a Gemini model, trained in approximately one week, achieved a gold medal by solving International Math Olympiad problems end-to-end, abandoning prior symbolic systems like AlphaProof. Tay emphasizes the shift to on-policy Reinforcement Learning (RL), where models learn from their own generated outputs and rewards, akin to human learning from mistakes. He also highlights the importance of self-consistency and parallel thinking for advanced reasoning, the ongoing challenge of data efficiency compared to human learning, and the emerging utility of AI coding assistants, which he now uses to fix bugs without manual inspection. The discussion also touches on DSI and generative retrieval, now deployed at YouTube and Spotify, and the increasing advantage of closed research labs.
Key takeaway
For AI Engineers and Research Scientists focused on advancing model capabilities, prioritize integrating on-policy RL and self-consistency into your training paradigms. The success of IMO Gold with an end-to-end Gemini model demonstrates that betting on a unified, RL-driven approach can yield significant breakthroughs, even in complex reasoning tasks. Consider how your models can learn more efficiently from their own experiences and verify their outputs, rather than solely relying on imitation learning, to push towards more generalized intelligence and practical utility.
Key insights
On-policy RL and self-consistency are key to advancing LLM reasoning and achieving AGI-like capabilities.
Principles
- Models learn best by generating and training on their own rewarded outputs.
- Self-consistency and parallel thinking enhance reasoning beyond single-shot inference.
- Ideas, not just blind scaling, drive AI progress.
Method
The IMO Gold effort involved training an end-to-end Gemini model for approximately one week, using on-policy RL to generate and self-correct solutions for live competition math problems, leveraging self-consistency for verification.
In practice
- Utilize AI coding assistants for automated bug fixing and code generation.
- Explore on-policy RL for tasks requiring self-generated trajectories and environmental feedback.
- Implement self-consistency techniques like multiple sampling and LM judges for robust reasoning.
Topics
- Gemini Deep Think
- Reinforcement Learning
- International Math Olympiad
- Generative Retrieval
- AI Coding Assistants
Best for: AI Engineer, AI Scientist, Research Scientist, AI Researcher, Machine Learning Engineer, Deep Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Latent Space: The AI Engineer Podcast.