Output Vector Editing for Memorization Mitigation in Large Language Models
Summary
A new method, output vector editing, addresses large language model memorization risks by minimally modifying MLP neuron output vectors instead of zeroing activations. This technique, evaluated on four models from 360M to 7B parameters (SmolLM-360M, OLMo-1B, OLMo-7B, Llama2-7B), achieved up to 87.9% suppression on 6,831 memorized sequences from OLMo-7B. This represents a 2.7x improvement over zero ablation on the same located neurons. Four distinct edit modes offer a spectrum from aggressive suppression to minimal redirection; the "Next-best" mode achieved 81.5% suppression with no catastrophic locality failures. Approximately 14% of memorized sequences resisted MLP-only editing, indicating attention-layer intervention as a complementary fallback.
Key takeaway
For AI Security Engineers concerned with LLM privacy and copyright risks, output vector editing offers a targeted mitigation strategy. You should prioritize the "Next-best" edit mode (k=5) for its 81.5% suppression rate on OLMo-7B with zero catastrophic locality failures. For comprehensive coverage, consider an ensemble of edit modes. Be aware that approximately 14% of memorized sequences may require complementary attention-layer interventions, especially for copy-style continuations.
Key insights
Output vector editing surgically mitigates LLM memorization by redirecting MLP neuron contributions, preserving other encoded features.
Principles
- MLP activations and output vectors have separable roles.
- Output vector editing is less destructive than activation zeroing.
- A success-locality trade-off exists across editing modes.
Method
Locates MLP neurons responsible for memorized continuations and applies a rank-one weight update to their output vectors, introducing a distractor token without gradient computation.
In practice
- Use "Next-best" edit mode (k=5) for 81.5% suppression with minimal locality cost.
- Combine edit modes for 96.5% coverage of memorized sequences.
- Consider attention-layer ablation for ~14% of MLP-resistant sequences.
Topics
- Output Vector Editing
- LLM Memorization
- MLP Neurons
- Model Editing
- Privacy Risks
- Copyright Infringement
- OLMo-7B
Code references
Best for: Research Scientist, CTO, Director of AI/ML, AI Scientist, Machine Learning Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.