Probe-Geometry Alignment: Erasing the Cross-Sequence Memorization Signature Below Chance

· Source: cs.NE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

This research characterizes and surgically removes internal traces of memorization in large language models (LLMs) that persist despite behavioral unlearning. The study introduces a "leave-one-out cross-sequence" (LOO) probe to detect a generalizable memorization signature across held-out sequences, finding consistent positive gaps of +0.32, +0.19, and +0.30 on Pythia-70M, GPT-2 medium, and Mistral-7B, respectively. This signature is causally separable from recall, meaning it can be projected out locally without significantly impacting behavioral recall. The paper also distinguishes between naturally memorized content and fine-tuning-injected secrets, showing they leave distinct representational traces. To address this, "probe-geometry alignment" (PGA) is introduced, a surgical erasure technique that aligns activations along the probe's readout direction. PGA drives the cross-sequence probe below random chance across all tested scales (e.g., Pythia-70M to 0.07, Mistral-7B to 0.45) and remains robust to six adversarial probe variants. An adversarial extension of PGA further defeats re-fitting attackers while preserving five zero-shot capability benchmarks within 2.8 percentage points.

Key takeaway

For research scientists developing or evaluating LLM unlearning techniques, you must move beyond purely behavioral metrics. Your unlearning claims should incorporate joint behavioral suppression (e.g., target generation probability <10^-3) and representational erasure (cross-sequence probing accuracy <=0.5). Implement Probe-Geometry Alignment (PGA) or its adversarial variant, restricting it to deep layers, to surgically remove internal memorization traces below chance while preserving model capabilities, thereby ensuring true data removal and regulatory compliance.

Key insights

LLMs retain internal memorization traces detectable by cross-sequence probes, which can be surgically removed without capability loss.

Principles

Method

Probe-geometry alignment (PGA) iteratively refits a cross-sequence linear probe, then trains the model to align activations along the probe's live readout direction at each depth, effectively erasing the signature.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.NE updates on arXiv.org.