Model Organisms Are Leaky: Perplexity Differencing Often Reveals Finetuning Objectives
Summary
A new perplexity-based method effectively identifies finetuning objectives in large language model (LLM) "model organisms" (MOs), which are finetuned to exhibit specific behaviors for AI safety research. The technique, called perplexity differencing, exploits the tendency of finetuned models to overgeneralize their learned behaviors beyond intended contexts. It involves generating diverse completions from the finetuned model using short, random prefills, then ranking these completions by the perplexity gap between a reference model and the finetuned model. The top-ranked completions often reveal the finetuning objectives, including hidden behaviors, false facts, and emergent misalignment, without requiring access to model internals or prior assumptions. Evaluated on 76 MOs ranging from 0.5B to 70B parameters, the method successfully surfaced objectives for the vast majority, performing particularly well on models trained via synthetic document finetuning or to produce exact phrases. The approach also works with cross-family reference models, broadening its applicability to API-gated models that expose token logprobs.
Key takeaway
For AI safety researchers and auditors evaluating LLMs for hidden or unintended behaviors, this perplexity differencing method offers a practical, non-invasive tool. You can effectively identify finetuning objectives, including emergent misalignment, even in API-gated models, by analyzing completions with high perplexity differences. This technique is particularly useful for auditing model organisms and understanding the scope of behavioral overgeneralization, informing more robust safety training and detection strategies.
Key insights
Perplexity differencing reveals finetuning objectives in LLMs by exploiting their overgeneralization of learned behaviors.
Principles
- Finetuning induces behavioral overgeneralization.
- Perplexity differences highlight novel, learned behaviors.
- Emergent behaviors can be detected beyond memorized data.
Method
Generate diverse completions from finetuned models using random prefills, then rank them by the perplexity difference between the finetuned and a reference model to surface finetuning objectives.
In practice
- Use perplexity differencing to audit API-gated LLMs.
- Employ cross-family models as reference when base model is unavailable.
- Analyze top-ranked completions for hidden or emergent behaviors.
Topics
- Perplexity Differencing
- Model Organisms
- LLM Finetuning Objectives
- Emergent Misalignment
- AI Safety Auditing
Code references
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.