MaxText Expands Post-Training Capabilities: Introducing SFT and RL on Single-Host TPUs
Summary
MaxText, a high-performance framework for large language models, has introduced new post-training capabilities: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) on single-host TPU configurations like v5p-8 and v6e-8. Released on April 16, 2026, these features leverage JAX and the Tunix library to streamline model refinement. SFT allows users to fine-tune existing MaxText or Hugging Face checkpoints, including models like Gemma 3, on labeled datasets, with native support for Hugging Face datasets such as ultrachat_200k. For complex reasoning tasks, MaxText now supports RL algorithms like Group Relative Policy Optimization (GRPO) and Group Sequence Policy Optimization (GSPO), utilizing vLLM for high-throughput inference. GRPO is memory-efficient, while GSPO improves stability for benchmarks like GSM8K.
Key takeaway
For AI Engineers and Research Scientists developing or refining large language models, MaxText's new SFT and RL capabilities on single-host TPUs offer a direct path to enhance model specialization and reasoning. You can now efficiently fine-tune models like Gemma 3 or implement advanced RL algorithms such as GRPO and GSPO without needing multi-host setups, accelerating your development cycles and making advanced post-training more accessible.
Key insights
MaxText now enables SFT and advanced RL on single-host TPUs, streamlining LLM post-training.
Principles
- Post-training is crucial for LLM specialization.
- Memory-efficient RL can run on single-host TPUs.
- Sequence-level rewards enhance RL stability.
Method
MaxText supports SFT for instruction-following and RL (GRPO/GSPO) for complex reasoning, using JAX, Tunix, and vLLM on single-host TPUs with specific installation and command-line execution for each method.
In practice
- Fine-tune Gemma 3 with SFT on TPUs.
- Use GRPO for memory-constrained RL.
- Apply GSPO to improve GSM8K performance.
Topics
- MaxText
- Supervised Fine-Tuning
- Reinforcement Learning
- Single-Host TPUs
- Group Relative Policy Optimization
Code references
Best for: AI Engineer, Research Scientist, Machine Learning Engineer, NLP Engineer, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Google Developers Blog - AI.