Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts

2026-04-09 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

A new study formalizes fact memorization in large language models (LLMs) from an information-theoretic perspective, demonstrating that fact accuracy is suboptimal when training data information exceeds model capacity, especially with skewed fact frequency distributions. The research proposes data selection schemes based solely on training loss to limit the number of facts and flatten their frequency distribution. On semi-synthetic datasets, this method boosts fact accuracy to the capacity limit. When applied to pretraining a GPT2-Small model (110M parameters) on an annotated Wikipedia corpus, the selection method enabled it to memorize 1.3X more entity facts than standard training, achieving performance comparable to a 1.3B parameter model trained on the full dataset.

Key takeaway

For AI Engineers optimizing LLM performance on knowledge-intensive tasks, consider implementing training data pruning and fact frequency flattening. This approach can significantly enhance factual memorization, potentially allowing smaller models like GPT2-Small to match the factual recall of models 10X larger, thereby improving efficiency and reducing computational costs without increasing model size.

Key insights

Training data pruning and frequency flattening significantly improve LLM factual memorization within capacity limits.

Principles

Fact accuracy is capacity-limited.
Skewed fact frequency reduces memorization.

Method

Data selection based on training loss limits fact count and flattens frequency distribution to optimize LLM factual memorization.

In practice

Prune training data using loss scores.
Flatten fact frequency distributions.
Improve GPT2-Small fact recall by 1.3X.

Topics

Large Language Models
Fact Memorization
Training Data Pruning
Information Theory
Data Selection Schemes

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.