Characterizing Narrative Content in Web-scale LLM Pretraining Data
Summary
A study by Johnson et al., submitted on June 17, 2026, presents the first fine-grained analysis of narrative features within the Dolma 3-trillion-token open pretraining corpus. The researchers developed a framework derived from narrative theory, encompassing three core elements—agency, setting, and events—operationalized into 11 interpretable dimensions. After annotating 400 diverse passages, they fine-tuned NarraBERT, a RoBERTa-based model, for narrative prediction. This model was then applied to 3 million passages, generating the new NarraDolma dataset. Key findings indicate that narrative structure is measurable at scale across varied web data, revealing a continuous, multidimensional narrative structure. Furthermore, narrative qualities are unevenly distributed across pretraining sources and topics, a factor not currently measured or accounted for in existing curation practices. The framework, NarraDolma dataset, and NarraBERT model are publicly released.
Key takeaway
For machine learning engineers curating pretraining data for large language models, you should recognize that narrative content is unevenly distributed across sources. This imbalance, now measurable with NarraBERT and NarraDolma, can significantly affect your model's narrative reasoning capabilities. Consider integrating narrative-aware metrics into your data selection and weighting processes to ensure more balanced and robust LLM performance on complex tasks.
Key insights
Narrative content in LLM pretraining data is measurable and unevenly distributed, impacting model capabilities.
Principles
- Narrative structure is quantifiable at scale.
- Web text has a continuous narrative dimension.
- Data curation overlooks narrative distribution.
Method
A framework with 11 dimensions (agency, setting, events) was used to annotate 400 passages. NarraBERT, a RoBERTa model, was fine-tuned and applied to 3M passages to create NarraDolma.
In practice
- Use NarraBERT for narrative prediction.
- Analyze NarraDolma for data composition.
- Inform LLM pretraining data curation.
Topics
- LLM Pretraining Data
- Narrative Theory
- NarraBERT
- Dolma Corpus
- Data Curation
- RoBERTa Model
Best for: AI Engineer, Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.