Probe-Geometry Alignment: Erasing the Cross-Sequence Memorization Signature Below Chance

2026-05-05 · Source: cs.NE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

This research characterizes and surgically removes internal traces of memorization in large language models (LLMs) that persist despite behavioral unlearning. The study introduces a "leave-one-out cross-sequence" (LOO) probe to detect a generalizable memorization signature across held-out sequences, finding consistent positive gaps of +0.32, +0.19, and +0.30 on Pythia-70M, GPT-2 medium, and Mistral-7B, respectively. This signature is causally separable from recall, meaning it can be projected out locally without significantly impacting behavioral recall. The paper also distinguishes between naturally memorized content and fine-tuning-injected secrets, showing they leave distinct representational traces. To address this, "probe-geometry alignment" (PGA) is introduced, a surgical erasure technique that aligns activations along the probe's readout direction. PGA drives the cross-sequence probe below random chance across all tested scales (e.g., Pythia-70M to 0.07, Mistral-7B to 0.45) and remains robust to six adversarial probe variants. An adversarial extension of PGA further defeats re-fitting attackers while preserving five zero-shot capability benchmarks within 2.8 percentage points.

Key takeaway

For research scientists developing or evaluating LLM unlearning techniques, you must move beyond purely behavioral metrics. Your unlearning claims should incorporate joint behavioral suppression (e.g., target generation probability <10^-3) and representational erasure (cross-sequence probing accuracy <=0.5). Implement Probe-Geometry Alignment (PGA) or its adversarial variant, restricting it to deep layers, to surgically remove internal memorization traces below chance while preserving model capabilities, thereby ensuring true data removal and regulatory compliance.

Key insights

LLMs retain internal memorization traces detectable by cross-sequence probes, which can be surgically removed without capability loss.

Principles

Behavioral unlearning is insufficient for true data removal.
Memorization signatures are causally separable from recall mechanisms.
Erasure constraint geometry must match probe readout geometry.

Method

Probe-geometry alignment (PGA) iteratively refits a cross-sequence linear probe, then trains the model to align activations along the probe's live readout direction at each depth, effectively erasing the signature.

In practice

Implement joint behavioral and representational unlearning criteria.
Use cross-sequence LOO probes for internal memorization detection.
Consider PGA for surgical erasure of internal data traces.

Topics

Machine Unlearning
Cross-Sequence Probing
Memorization Signature
Probe-Geometry Alignment
Behavioral-Representational Dissociation

Code references

Rupawheatly/MLDU

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.NE updates on arXiv.org.