How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data

2026-04-15 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, quick

Summary

A systematic study investigated methods for synthesizing high-quality pretraining data for large language models, generating over one trillion tokens to compare rephrasing strategies, generator models, and source data. The research found that structured output formats like tables, math problems, FAQs, and tutorials consistently produced better synthetic data than curated web baselines and previous synthetic methods. Interestingly, generator models larger than 1 billion parameters offered no further performance gains. The choice of original data for mixing also significantly impacted performance. Applying these findings, the researchers developed "FinePhrase", a 486-billion-token open dataset of rephrased web text, which outperformed existing synthetic data baselines and reduced generation costs by up to 30 times.

Key takeaway

For AI Engineers and Research Scientists developing large language models, you should prioritize generating synthetic pretraining data using structured output formats such as tables or FAQs. This approach, combined with careful selection of original source data, can significantly improve model performance and reduce data generation costs by up to 30 times, as demonstrated by the "FinePhrase" dataset.

Key insights

Structured output formats and careful source data selection are key to high-quality synthetic pretraining data.

Principles

Structured outputs enhance synthetic data quality.
Generator models beyond 1B parameters offer diminishing returns.
Source data selection critically impacts performance.

Method

The study involved extensive controlled experiments comparing rephrasing strategies, generator models, and source data to identify critical factors in synthesizing pretraining data.

In practice

Prioritize structured formats for synthetic data generation.
Optimize generator model size for cost-efficiency.
Carefully select original data for mixing.

Topics

Synthetic Data Generation
Pretraining Data
Prompt Design
Generator Model Size
Structured Output Formats

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.