Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts

2026-04-13 · Source: Apple Machine Learning Research · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, quick

Summary

A new paper accepted at ICLR 2026's Workshop on Navigating and Addressing Data Problems for Foundation Models formalizes fact memorization in large language models (LLMs) from an information-theoretic perspective. The research investigates how training data distributions impact fact accuracy, revealing that accuracy is suboptimal when training data information exceeds model capacity, especially with skewed fact frequency distributions like power laws. The authors propose data selection schemes based on training loss to limit fact count and flatten frequency distributions. Applying this method to semi-synthetic datasets with high-entropy facts boosts accuracy to capacity limits. When pretraining a GPT2-Small model (110M parameters) on an annotated Wikipedia corpus, the selection method enabled it to memorize 1.3X more entity facts than standard training, achieving the performance of a 1.3B parameter model trained on the full dataset.

Key takeaway

For AI Engineers and Research Scientists focused on improving LLM factual recall and mitigating hallucinations, consider implementing data selection schemes based on training loss. This approach can significantly enhance a model's ability to memorize facts, potentially allowing smaller models like GPT2-Small to achieve the factual performance of much larger counterparts, thereby optimizing resource utilization and model efficiency.

Key insights

Fact memorization in LLMs is limited by model capacity and exacerbated by skewed training data distributions.

Principles

Fact accuracy is suboptimal if training data information exceeds model capacity.
Skewed fact frequency distributions worsen memorization.
Data selection can improve fact memorization.

Method

Data selection schemes based on training loss limit fact count and flatten frequency distributions to improve LLM fact memorization.

In practice

Use data selection to boost LLM fact accuracy.
Pretrain smaller models to match larger model performance.

Topics

Fact Memorization
Large Language Models
Training Data Pruning
Information Theory
Data Selection Schemes

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Apple Machine Learning Research.