Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts
Summary
A new study formalizes fact memorization in large language models (LLMs) from an information-theoretic perspective, demonstrating that fact accuracy is suboptimal when training data information exceeds model capacity, especially with skewed fact frequency distributions. The research proposes data selection schemes based solely on training loss to limit the number of facts and flatten their frequency distribution. On semi-synthetic datasets, this method boosts fact accuracy to the capacity limit. When applied to pretraining a GPT2-Small model (110M parameters) on an annotated Wikipedia corpus, the selection method enabled it to memorize 1.3X more entity facts than standard training, achieving performance comparable to a 1.3B parameter model trained on the full dataset.
Key takeaway
For AI Engineers optimizing LLM performance on knowledge-intensive tasks, consider implementing training data pruning and fact frequency flattening. This approach can significantly enhance factual memorization, potentially allowing smaller models like GPT2-Small to match the factual recall of models 10X larger, thereby improving efficiency and reducing computational costs without increasing model size.
Key insights
Training data pruning and frequency flattening significantly improve LLM factual memorization within capacity limits.
Principles
- Fact accuracy is capacity-limited.
- Skewed fact frequency reduces memorization.
Method
Data selection based on training loss limits fact count and flattens frequency distribution to optimize LLM factual memorization.
In practice
- Prune training data using loss scores.
- Flatten fact frequency distributions.
- Improve GPT2-Small fact recall by 1.3X.
Topics
- Large Language Models
- Fact Memorization
- Training Data Pruning
- Information Theory
- Data Selection Schemes
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.