MaxText Expands Post-Training Capabilities: Introducing SFT and RL on Single-Host TPUs

2026-04-16 · Source: Google Developers Blog - AI · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, quick

Summary

MaxText, a high-performance framework for large language models, has introduced new post-training capabilities: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) on single-host TPU configurations like v5p-8 and v6e-8. Released on April 16, 2026, these features leverage JAX and the Tunix library to streamline model refinement. SFT allows users to fine-tune existing MaxText or Hugging Face checkpoints, including models like Gemma 3, on labeled datasets, with native support for Hugging Face datasets such as ultrachat_200k. For complex reasoning tasks, MaxText now supports RL algorithms like Group Relative Policy Optimization (GRPO) and Group Sequence Policy Optimization (GSPO), utilizing vLLM for high-throughput inference. GRPO is memory-efficient, while GSPO improves stability for benchmarks like GSM8K.

Key takeaway

For AI Engineers and Research Scientists developing or refining large language models, MaxText's new SFT and RL capabilities on single-host TPUs offer a direct path to enhance model specialization and reasoning. You can now efficiently fine-tune models like Gemma 3 or implement advanced RL algorithms such as GRPO and GSPO without needing multi-host setups, accelerating your development cycles and making advanced post-training more accessible.

Key insights

MaxText now enables SFT and advanced RL on single-host TPUs, streamlining LLM post-training.

Principles

Post-training is crucial for LLM specialization.
Memory-efficient RL can run on single-host TPUs.
Sequence-level rewards enhance RL stability.

Method

MaxText supports SFT for instruction-following and RL (GRPO/GSPO) for complex reasoning, using JAX, Tunix, and vLLM on single-host TPUs with specific installation and command-line execution for each method.

In practice

Fine-tune Gemma 3 with SFT on TPUs.
Use GRPO for memory-constrained RL.
Apply GSPO to improve GSM8K performance.

Topics

MaxText
Supervised Fine-Tuning
Reinforcement Learning
Single-Host TPUs
Group Relative Policy Optimization

Code references

Best for: AI Engineer, Research Scientist, Machine Learning Engineer, NLP Engineer, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Google Developers Blog - AI.