Data Augmentations for Data-Constrained Language Model Pretraining
Summary
Researchers investigate data augmentation as a regularizer to combat overfitting in autoregressive (AR) language model pretraining, particularly in data-constrained, compute-abundant environments where compute capacity exceeds high-quality text generation. Standard AR pretraining overfits severely, deteriorating after an early optimum. The study introduces three orthogonal augmentation categories: token-level noise (masking, random replacement), sequence permutations (right-to-left prediction, Fill-in-the-Middle), and target offset prediction ($x_{t+i}$ for $i > 1$). Systematic ablations show individual augmentations delay overfitting and lower validation loss, with random token replacement achieving the best minimum loss. Combining these categories further reduces minimum validation loss, demonstrating their effectiveness in improving data efficiency. All code and data are publicly available.
Key takeaway
For Machine Learning Engineers training large language models with limited high-quality data, you should integrate data augmentations into your pretraining pipeline. Applying techniques like random token replacement or combining different augmentation categories can significantly mitigate overfitting and enable productive multi-epoch training on fixed corpora. This approach improves data efficiency, allowing you to extract more value from your existing datasets and achieve lower validation loss.
Key insights
Data augmentations effectively mitigate overfitting in data-constrained language model pretraining, enabling productive multi-epoch training.
Principles
- Overfitting is severe in data-constrained AR pretraining.
- Orthogonal augmentation categories can be combined for better results.
- Data augmentations improve data efficiency for LMs.
Method
The method involves applying token-level noise, sequence permutations, and target offset prediction during autoregressive pretraining. These augmentations regularize the model, allowing for hundreds of epochs on fixed corpora.
In practice
- Implement random token replacement for AR pretraining.
- Combine different augmentation types for improved loss.
- Utilize multi-epoch training on fixed datasets.
Topics
- Data Augmentation
- Language Model Pretraining
- Autoregressive Models
- Overfitting Mitigation
- Token-level Noise
- Data-Constrained Training
Code references
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.