LLMs Can Leak Training Data But Do They Want To? A Propensity-Aware Evaluation of Memorization in LLMs

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy, Data Science & Analytics · Depth: Advanced, quick

Summary

The PropMe framework introduces a propensity-aware approach for evaluating large language model memorization, distinguishing between "capability attacks" that force data reproduction and "propensity" for ordinary leakage. This framework, utilizing a metric transformation and the SimpleTrace pipeline built on infini-gram, deterministically attributes model generations to training corpora like Common Pile and Dynaword. Evaluating Comma and DFM Decoder models, the study found a consistent gap: prefix attacks elicited substantially stronger memorization signals than generic prompts, yet overall propensity scores remained low. This indicates models can reveal training data when directly elicited but rarely do so in non-adversarial settings. Furthermore, DFM Decoder, continually pre-trained from Comma, exhibited reduced memorization for Common Pile, suggesting that emphasizing partially different data during later training can decrease memorization capability. The research encourages comprehensive memorization audits reporting both worst-case extractability and ordinary leakage propensity.

Key takeaway

For AI Security Engineers evaluating LLM data leakage, you must differentiate between a model's potential to leak and its actual propensity under normal use. Your audits should report both worst-case extractability, often revealed by prefix attacks, and ordinary leakage propensity. This comprehensive view helps you accurately assess privacy risks. When designing continual pre-training, emphasize partially different data. This can actively reduce memorization capability, as DFM Decoder showed for Common Pile.

Key insights

LLMs can be forced to leak training data, but rarely do so under ordinary, non-adversarial use.

Principles

Method

PropMe framework contrasts prefix-based capability attacks with non-adversarial evaluations, using a metric transformation and SimpleTrace (built on infini-gram) to attribute generations and compute verbatim, near-verbatim, and propensity metrics.

In practice

Topics

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.