How the community trained Gemma to "Think" with Tunix and TPUs

2026-05-28 · Source: Google Developers Blog - AI · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, short

Summary

The Google Tunix Hackathon, held on Kaggle, challenged over 11,000 entrants to train Gemma models for general reasoning using Tunix and Kaggle TPUs. Developers transformed non-reasoning base models (Gemma-2-2B and Gemma-3-1B) into general reasoning models with limited compute (Kaggle TPU v5e-8 for 9 hours). Winning techniques combined supervised learning, preference optimization, and reinforcement learning. The first-place G-RaR used Supervised Fine-Tuning (SFT) with GRPO and a rubric-based LLM-as-judge reward system, employing a Gemma-3-12B judge model. Second-place Pinocchio-1B evolved a 1B parameter model via SFT, SimPO, and GRPO, extending Tunix for custom loss and asynchronous evaluation. Third-place IDEA-E distilled an ethical reasoning framework into a 2B model using curriculum-guided GRPO and a TF-IDF reward. Other approaches included on-policy distillation and custom dataset curation for domain-specific reasoning in medical, chemistry, legal, and robotics.

Key takeaway

For AI Engineers developing reasoning capabilities on smaller LLMs, you should explore multi-stage post-training pipelines combining SFT, preference optimization, and GRPO. Utilize Tunix and Kaggle TPUs to achieve strong results even with limited compute budgets, as demonstrated by the hackathon winners. Consider implementing custom reward functions or LLM-as-judge systems to provide dense feedback and refine reasoning logic for your specific applications.

Key insights

Community hackathons can effectively train LLMs for general reasoning using limited compute.

Principles

LLM-as-judge provides dense, smooth feedback.
Sequential training stages refine reasoning.
Custom reward functions enhance logic.

Method

Winning methods generally involve a multi-stage post-training pipeline: Supervised Fine-Tuning (SFT) for baseline, followed by GRPO or SimPO for refinement, often with custom reward functions or judge models.

In practice

Utilize Tunix with Kaggle TPUs.
Implement multi-stage training pipelines.
Integrate custom reward functions.

Topics

Gemma Models
Chain-of-Thought Reasoning
Tunix Framework
Reinforcement Learning
Supervised Fine-Tuning
Kaggle TPUs

Code references

google/tunix

Best for: AI Engineer, Machine Learning Engineer, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Google Developers Blog - AI.