Characterizing Narrative Content in Web-scale LLM Pretraining Data

2026-06-17 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

A new study characterizes narrative content within web-scale LLM pretraining corpora, a previously underexplored area despite narrative being a fundamental human communication mode. Focusing on Dolma, a 3-trillion-token open pretraining corpus, researchers developed a framework based on narrative theory, encompassing three core elements (agency, setting, events) operationalized into 11 interpretable dimensions. After annotating 400 diverse passages, they finetuned NarraBERT, a RoBERTa-based model for fine-grained narrative prediction, and applied it to 3 million passages to create NarraDolma. Key findings indicate that narrative structure is measurable at scale across heterogeneous data, web text exhibits a continuous, multidimensional narrative structure, and narrative qualities are unequally distributed across pretraining sources and topics, which current curation practices do not account for.

Key takeaway

For NLP Engineers focused on LLM pretraining data curation, this study highlights that narrative qualities are unevenly distributed across sources and topics, a factor current practices do not measure. You should consider integrating fine-grained narrative analysis, using tools like NarraBERT or the proposed framework, to better understand and potentially balance the narrative composition of your training corpora, which could improve model performance on narrative reasoning tasks.

Key insights

Narrative structure in web-scale LLM pretraining data is measurable and unevenly distributed, impacting model reasoning.

Principles

Narrative features are measurable at web scale.
Web text has continuous, multidimensional narrative structure.
Current data curation overlooks narrative quality distribution.

Method

Design a framework with 3 narrative elements (agency, setting, events) and 11 dimensions. Annotate passages, finetune a RoBERTa-based model (NarraBERT), then apply it at scale to create a new dataset.

In practice

Use NarraBERT for fine-grained narrative prediction.
Analyze NarraDolma to understand narrative distribution.
Inform LLM data curation with narrative insights.

Topics

LLM Pretraining Data
Narrative Analysis
Data Curation
NarraBERT
Dolma Corpus
Computational Linguistics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.