Probe-Geometry Alignment: Erasing the Cross-Sequence Memorization Signature Below Chance
Summary
This research characterizes and surgically removes internal traces of memorization in large language models (LLMs) that persist despite behavioral unlearning. The study introduces a "leave-one-out cross-sequence" (LOO) probe to detect a generalizable memorization signature across held-out sequences, finding consistent positive gaps of +0.32, +0.19, and +0.30 on Pythia-70M, GPT-2 medium, and Mistral-7B, respectively. This signature is causally separable from recall, meaning it can be projected out locally without significantly impacting behavioral recall. The paper also distinguishes between naturally memorized content and fine-tuning-injected secrets, showing they leave distinct representational traces. To address this, "probe-geometry alignment" (PGA) is introduced, a surgical erasure technique that aligns activations along the probe's readout direction. PGA drives the cross-sequence probe below random chance across all tested scales (e.g., Pythia-70M to 0.07, Mistral-7B to 0.45) and remains robust to six adversarial probe variants. An adversarial extension of PGA further defeats re-fitting attackers while preserving five zero-shot capability benchmarks within 2.8 percentage points.
Key takeaway
For research scientists developing or evaluating LLM unlearning techniques, you must move beyond purely behavioral metrics. Your unlearning claims should incorporate joint behavioral suppression (e.g., target generation probability <10^-3) and representational erasure (cross-sequence probing accuracy <=0.5). Implement Probe-Geometry Alignment (PGA) or its adversarial variant, restricting it to deep layers, to surgically remove internal memorization traces below chance while preserving model capabilities, thereby ensuring true data removal and regulatory compliance.
Key insights
LLMs retain internal memorization traces detectable by cross-sequence probes, which can be surgically removed without capability loss.
Principles
- Behavioral unlearning is insufficient for true data removal.
- Memorization signatures are causally separable from recall mechanisms.
- Erasure constraint geometry must match probe readout geometry.
Method
Probe-geometry alignment (PGA) iteratively refits a cross-sequence linear probe, then trains the model to align activations along the probe's live readout direction at each depth, effectively erasing the signature.
In practice
- Implement joint behavioral and representational unlearning criteria.
- Use cross-sequence LOO probes for internal memorization detection.
- Consider PGA for surgical erasure of internal data traces.
Topics
- Machine Unlearning
- Cross-Sequence Probing
- Memorization Signature
- Probe-Geometry Alignment
- Behavioral-Representational Dissociation
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.NE updates on arXiv.org.