SFT Drives Gemini’s Safety Properties

· Source: AI Alignment Forum · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Advanced, quick

Summary

Google DeepMind's Language Model Interpretability team found that most safety-relevant properties in Gemini models, specifically Gemini 3.1 Pro and Gemini 3 Flash, stem primarily from the combination of pretraining and Supervised Fine-Tuning (SFT), rather than other training stages like Reinforcement Learning (RL). This unexpected finding emerged from an experiment where SFT was applied to pre-training-only versions of Gemini 3.1 Pro and Gemini 3 Flash. These SFT-only models were then compared against their production counterparts across various safety benchmarks, including ODCV, modified Petri v2 alignment evaluations, general safety evaluations for over-refusal and unsafe responses, a reward hacking environment, and analysis of 50,000 anonymized free tier user logs. The results showed "remarkably similar" performance between SFT-only and production models across all evaluations, indicating SFT as a high-leverage point for safety interventions in Gemini.

Key takeaway

For Machine Learning Engineers focused on large language model safety, this research suggests you should prioritize Supervised Fine-Tuning (SFT) as a primary intervention point for models like Gemini. If you are designing safety mechanisms, understand that SFT, combined with pretraining, appears to drive most safety properties, potentially reducing the relative impact of later stages like Reinforcement Learning. Focus your efforts on refining SFT data and processes to achieve desired safety outcomes efficiently.

Key insights

Gemini's safety primarily derives from pretraining and SFT, not RL, making SFT a key intervention point.

Principles

Method

The team performed SFT on pre-training-only Gemini 3.1 Pro and Gemini 3 Flash, then compared these models to production versions using ODCV, alignment, safety, reward hacking, and user log benchmarks.

In practice

Topics

Best for: Research Scientist, NLP Engineer, AI Scientist, AI Ethicist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AI Alignment Forum.