Captaining IMO Gold, Deep Think, On-Policy RL, Feeling the AGI in Singapore — Yi Tay

2026-02-06 · Source: Latent Space: The AI Engineer Podcast · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Emerging Technologies & Innovation · Depth: Expert, extended

Summary

Yi Tay, a key figure at Google DeepMind, discusses his journey from Brain to Reka to leading the Reasoning and AGI team in Singapore, focusing on Gemini Deep Think and IMO Gold. He details the IMO Gold effort, a live competition where a Gemini model, trained in approximately one week, achieved a gold medal by solving International Math Olympiad problems end-to-end, abandoning prior symbolic systems like AlphaProof. Tay emphasizes the shift to on-policy Reinforcement Learning (RL), where models learn from their own generated outputs and rewards, akin to human learning from mistakes. He also highlights the importance of self-consistency and parallel thinking for advanced reasoning, the ongoing challenge of data efficiency compared to human learning, and the emerging utility of AI coding assistants, which he now uses to fix bugs without manual inspection. The discussion also touches on DSI and generative retrieval, now deployed at YouTube and Spotify, and the increasing advantage of closed research labs.

Key takeaway

For AI Engineers and Research Scientists focused on advancing model capabilities, prioritize integrating on-policy RL and self-consistency into your training paradigms. The success of IMO Gold with an end-to-end Gemini model demonstrates that betting on a unified, RL-driven approach can yield significant breakthroughs, even in complex reasoning tasks. Consider how your models can learn more efficiently from their own experiences and verify their outputs, rather than solely relying on imitation learning, to push towards more generalized intelligence and practical utility.

Key insights

On-policy RL and self-consistency are key to advancing LLM reasoning and achieving AGI-like capabilities.

Principles

Models learn best by generating and training on their own rewarded outputs.
Self-consistency and parallel thinking enhance reasoning beyond single-shot inference.
Ideas, not just blind scaling, drive AI progress.

Method

The IMO Gold effort involved training an end-to-end Gemini model for approximately one week, using on-policy RL to generate and self-correct solutions for live competition math problems, leveraging self-consistency for verification.

In practice

Utilize AI coding assistants for automated bug fixing and code generation.
Explore on-policy RL for tasks requiring self-generated trajectories and environmental feedback.
Implement self-consistency techniques like multiple sampling and LM judges for robust reasoning.

Topics

Gemini Deep Think
Reinforcement Learning
International Math Olympiad
Generative Retrieval
AI Coding Assistants

Best for: AI Engineer, AI Scientist, Research Scientist, AI Researcher, Machine Learning Engineer, Deep Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Latent Space: The AI Engineer Podcast.