SFT Drives Gemini’s Safety Properties
Summary
Google DeepMind's Language Model Interpretability team found that most safety-relevant properties in Gemini models, specifically Gemini 3.1 Pro and Gemini 3 Flash, stem primarily from the combination of pretraining and Supervised Fine-Tuning (SFT), rather than other training stages like Reinforcement Learning (RL). This unexpected finding emerged from an experiment where SFT was applied to pre-training-only versions of Gemini 3.1 Pro and Gemini 3 Flash. These SFT-only models were then compared against their production counterparts across various safety benchmarks, including ODCV, modified Petri v2 alignment evaluations, general safety evaluations for over-refusal and unsafe responses, a reward hacking environment, and analysis of 50,000 anonymized free tier user logs. The results showed "remarkably similar" performance between SFT-only and production models across all evaluations, indicating SFT as a high-leverage point for safety interventions in Gemini.
Key takeaway
For Machine Learning Engineers focused on large language model safety, this research suggests you should prioritize Supervised Fine-Tuning (SFT) as a primary intervention point for models like Gemini. If you are designing safety mechanisms, understand that SFT, combined with pretraining, appears to drive most safety properties, potentially reducing the relative impact of later stages like Reinforcement Learning. Focus your efforts on refining SFT data and processes to achieve desired safety outcomes efficiently.
Key insights
Gemini's safety primarily derives from pretraining and SFT, not RL, making SFT a key intervention point.
Principles
- SFT significantly shapes model safety.
- Training stages have distinct safety impacts.
- Benchmark similarity indicates training stage efficacy.
Method
The team performed SFT on pre-training-only Gemini 3.1 Pro and Gemini 3 Flash, then compared these models to production versions using ODCV, alignment, safety, reward hacking, and user log benchmarks.
In practice
- Focus safety interventions on SFT.
- Evaluate SFT impact using diverse benchmarks.
- Analyze user logs for real-world safety.
Topics
- Gemini
- Supervised Fine-Tuning
- Large Language Model Safety
- Model Interpretability
- Reinforcement Learning
- LLM Benchmarking
Best for: Research Scientist, NLP Engineer, AI Scientist, AI Ethicist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AI Alignment Forum.