SFT Drives Gemini’s Safety Properties

2026-06-13 · Source: AI Alignment Forum · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Advanced, quick

Summary

Google DeepMind's Language Model Interpretability team found that most safety-relevant properties in Gemini models, specifically Gemini 3.1 Pro and Gemini 3 Flash, stem primarily from the combination of pretraining and Supervised Fine-Tuning (SFT), rather than other training stages like Reinforcement Learning (RL). This unexpected finding emerged from an experiment where SFT was applied to pre-training-only versions of Gemini 3.1 Pro and Gemini 3 Flash. These SFT-only models were then compared against their production counterparts across various safety benchmarks, including ODCV, modified Petri v2 alignment evaluations, general safety evaluations for over-refusal and unsafe responses, a reward hacking environment, and analysis of 50,000 anonymized free tier user logs. The results showed "remarkably similar" performance between SFT-only and production models across all evaluations, indicating SFT as a high-leverage point for safety interventions in Gemini.

Key takeaway

For Machine Learning Engineers focused on large language model safety, this research suggests you should prioritize Supervised Fine-Tuning (SFT) as a primary intervention point for models like Gemini. If you are designing safety mechanisms, understand that SFT, combined with pretraining, appears to drive most safety properties, potentially reducing the relative impact of later stages like Reinforcement Learning. Focus your efforts on refining SFT data and processes to achieve desired safety outcomes efficiently.

Key insights

Gemini's safety primarily derives from pretraining and SFT, not RL, making SFT a key intervention point.

Principles

SFT significantly shapes model safety.
Training stages have distinct safety impacts.
Benchmark similarity indicates training stage efficacy.

Method

The team performed SFT on pre-training-only Gemini 3.1 Pro and Gemini 3 Flash, then compared these models to production versions using ODCV, alignment, safety, reward hacking, and user log benchmarks.

In practice

Focus safety interventions on SFT.
Evaluate SFT impact using diverse benchmarks.
Analyze user logs for real-world safety.

Topics

Gemini
Supervised Fine-Tuning
Large Language Model Safety
Model Interpretability
Reinforcement Learning
LLM Benchmarking

Best for: Research Scientist, NLP Engineer, AI Scientist, AI Ethicist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AI Alignment Forum.