Model Organisms Are Leaky: Perplexity Differencing Often Reveals Finetuning Objectives

2025-09-15 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

A new perplexity-based method effectively identifies finetuning objectives in large language model (LLM) "model organisms" (MOs), which are finetuned to exhibit specific behaviors for AI safety research. The technique, called perplexity differencing, exploits the tendency of finetuned models to overgeneralize their learned behaviors beyond intended contexts. It involves generating diverse completions from the finetuned model using short, random prefills, then ranking these completions by the perplexity gap between a reference model and the finetuned model. The top-ranked completions often reveal the finetuning objectives, including hidden behaviors, false facts, and emergent misalignment, without requiring access to model internals or prior assumptions. Evaluated on 76 MOs ranging from 0.5B to 70B parameters, the method successfully surfaced objectives for the vast majority, performing particularly well on models trained via synthetic document finetuning or to produce exact phrases. The approach also works with cross-family reference models, broadening its applicability to API-gated models that expose token logprobs.

Key takeaway

For AI safety researchers and auditors evaluating LLMs for hidden or unintended behaviors, this perplexity differencing method offers a practical, non-invasive tool. You can effectively identify finetuning objectives, including emergent misalignment, even in API-gated models, by analyzing completions with high perplexity differences. This technique is particularly useful for auditing model organisms and understanding the scope of behavioral overgeneralization, informing more robust safety training and detection strategies.

Key insights

Perplexity differencing reveals finetuning objectives in LLMs by exploiting their overgeneralization of learned behaviors.

Principles

Finetuning induces behavioral overgeneralization.
Perplexity differences highlight novel, learned behaviors.
Emergent behaviors can be detected beyond memorized data.

Method

Generate diverse completions from finetuned models using random prefills, then rank them by the perplexity difference between the finetuned and a reference model to surface finetuning objectives.

In practice

Use perplexity differencing to audit API-gated LLMs.
Employ cross-family models as reference when base model is unavailable.
Analyze top-ranked completions for hidden or emergent behaviors.

Topics

Perplexity Differencing
Model Organisms
LLM Finetuning Objectives
Emergent Misalignment
AI Safety Auditing

Code references

safety-research/petri

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.