Characterizing Narrative Content in Web-scale LLM Pretraining Data
Summary
A new study characterizes narrative content within web-scale LLM pretraining corpora, a previously underexplored area despite narrative being a fundamental human communication mode. Focusing on Dolma, a 3-trillion-token open pretraining corpus, researchers developed a framework based on narrative theory, encompassing three core elements (agency, setting, events) operationalized into 11 interpretable dimensions. After annotating 400 diverse passages, they finetuned NarraBERT, a RoBERTa-based model for fine-grained narrative prediction, and applied it to 3 million passages to create NarraDolma. Key findings indicate that narrative structure is measurable at scale across heterogeneous data, web text exhibits a continuous, multidimensional narrative structure, and narrative qualities are unequally distributed across pretraining sources and topics, which current curation practices do not account for.
Key takeaway
For NLP Engineers focused on LLM pretraining data curation, this study highlights that narrative qualities are unevenly distributed across sources and topics, a factor current practices do not measure. You should consider integrating fine-grained narrative analysis, using tools like NarraBERT or the proposed framework, to better understand and potentially balance the narrative composition of your training corpora, which could improve model performance on narrative reasoning tasks.
Key insights
Narrative structure in web-scale LLM pretraining data is measurable and unevenly distributed, impacting model reasoning.
Principles
- Narrative features are measurable at web scale.
- Web text has continuous, multidimensional narrative structure.
- Current data curation overlooks narrative quality distribution.
Method
Design a framework with 3 narrative elements (agency, setting, events) and 11 dimensions. Annotate passages, finetune a RoBERTa-based model (NarraBERT), then apply it at scale to create a new dataset.
In practice
- Use NarraBERT for fine-grained narrative prediction.
- Analyze NarraDolma to understand narrative distribution.
- Inform LLM data curation with narrative insights.
Topics
- LLM Pretraining Data
- Narrative Analysis
- Data Curation
- NarraBERT
- Dolma Corpus
- Computational Linguistics
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.