How the community trained Gemma to "Think" with Tunix and TPUs
Summary
The Google Tunix Hackathon, held on Kaggle, challenged over 11,000 entrants to train Gemma models for general reasoning using Tunix and Kaggle TPUs. Developers transformed non-reasoning base models (Gemma-2-2B and Gemma-3-1B) into general reasoning models with limited compute (Kaggle TPU v5e-8 for 9 hours). Winning techniques combined supervised learning, preference optimization, and reinforcement learning. The first-place G-RaR used Supervised Fine-Tuning (SFT) with GRPO and a rubric-based LLM-as-judge reward system, employing a Gemma-3-12B judge model. Second-place Pinocchio-1B evolved a 1B parameter model via SFT, SimPO, and GRPO, extending Tunix for custom loss and asynchronous evaluation. Third-place IDEA-E distilled an ethical reasoning framework into a 2B model using curriculum-guided GRPO and a TF-IDF reward. Other approaches included on-policy distillation and custom dataset curation for domain-specific reasoning in medical, chemistry, legal, and robotics.
Key takeaway
For AI Engineers developing reasoning capabilities on smaller LLMs, you should explore multi-stage post-training pipelines combining SFT, preference optimization, and GRPO. Utilize Tunix and Kaggle TPUs to achieve strong results even with limited compute budgets, as demonstrated by the hackathon winners. Consider implementing custom reward functions or LLM-as-judge systems to provide dense feedback and refine reasoning logic for your specific applications.
Key insights
Community hackathons can effectively train LLMs for general reasoning using limited compute.
Principles
- LLM-as-judge provides dense, smooth feedback.
- Sequential training stages refine reasoning.
- Custom reward functions enhance logic.
Method
Winning methods generally involve a multi-stage post-training pipeline: Supervised Fine-Tuning (SFT) for baseline, followed by GRPO or SimPO for refinement, often with custom reward functions or judge models.
In practice
- Utilize Tunix with Kaggle TPUs.
- Implement multi-stage training pipelines.
- Integrate custom reward functions.
Topics
- Gemma Models
- Chain-of-Thought Reasoning
- Tunix Framework
- Reinforcement Learning
- Supervised Fine-Tuning
- Kaggle TPUs
Code references
Best for: AI Engineer, Machine Learning Engineer, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Google Developers Blog - AI.