How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data

2026-04-16 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Software Development & Engineering · Depth: Expert, extended

Summary

Hugging Face researchers conducted a systematic study on synthesizing high-quality pretraining data for large language models (LLMs), generating over one trillion tokens to compare rephrasing strategies, generator models, and source data. Their findings indicate that structured output formats like tables, math problems, FAQs, and tutorials consistently outperform existing synthetic methods and curated web baselines. They also discovered that increasing generator model size beyond 1 billion parameters offers no additional benefit, and the selection of original "mix-in" data significantly impacts performance. Applying these insights, the team developed FinePhrase, a 486-billion-token open dataset of rephrased web text, which surpasses all current synthetic data baselines while reducing generation costs by up to 30 times. The dataset, prompts, and generation framework are openly available.

Key takeaway

For AI Engineers and Research Scientists developing LLMs, focus on crafting diverse, structured rephrasing prompts rather than relying on larger generator models. Integrating synthetic data generated with pedagogical formats (like math problems or tables) alongside high-quality original web text will yield superior model performance and significantly reduce compute costs, enabling more efficient pretraining pipelines.

Key insights

Structured pedagogical formats and diverse outputs are key to high-quality synthetic pretraining data, not larger generator models.

Principles

Rephrasing prompt design is the dominant factor for downstream performance.
Generator models beyond 1B parameters offer negligible gains for most rephrasing tasks.
Output diversity is more critical than formatting consistency.

Method

Systematic evaluation of synthetic data generation by ablating rephrasing strategy, generator model scale/architecture, and source/mix-in data composition, validated by training 1.2B-parameter LLMs for 21B tokens.

In practice

Prioritize structured pedagogical prompts (e.g., math, table, FAQ, tutorial).
Use smaller, efficient generator models (e.g., 1B-parameter class) for cost-effective synthesis.
Always mix synthetic data with original web text to preserve NLU capabilities.

Topics

Synthetic Data Generation
LLM Pretraining Data
Prompt Design Strategies
Generator Model Scale
FinePhrase Dataset

Code references

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.