How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data
Summary
A systematic study investigated methods for synthesizing high-quality pretraining data for large language models, generating over one trillion tokens to compare rephrasing strategies, generator models, and source data. The research found that structured output formats like tables, math problems, FAQs, and tutorials consistently produced better synthetic data than curated web baselines and previous synthetic methods. Interestingly, generator models larger than 1 billion parameters offered no further performance gains. The choice of original data for mixing also significantly impacted performance. Applying these findings, the researchers developed "FinePhrase", a 486-billion-token open dataset of rephrased web text, which outperformed existing synthetic data baselines and reduced generation costs by up to 30 times.
Key takeaway
For AI Engineers and Research Scientists developing large language models, you should prioritize generating synthetic pretraining data using structured output formats such as tables or FAQs. This approach, combined with careful selection of original source data, can significantly improve model performance and reduce data generation costs by up to 30 times, as demonstrated by the "FinePhrase" dataset.
Key insights
Structured output formats and careful source data selection are key to high-quality synthetic pretraining data.
Principles
- Structured outputs enhance synthetic data quality.
- Generator models beyond 1B parameters offer diminishing returns.
- Source data selection critically impacts performance.
Method
The study involved extensive controlled experiments comparing rephrasing strategies, generator models, and source data to identify critical factors in synthesizing pretraining data.
In practice
- Prioritize structured formats for synthetic data generation.
- Optimize generator model size for cost-efficiency.
- Carefully select original data for mixing.
Topics
- Synthetic Data Generation
- Pretraining Data
- Prompt Design
- Generator Model Size
- Structured Output Formats
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.