Characterizing Narrative Content in Web-scale LLM Pretraining Data

2026-06-19 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Natural Language Processing · Depth: Expert, quick

Summary

A study by Johnson et al., submitted on June 17, 2026, presents the first fine-grained analysis of narrative features within the Dolma 3-trillion-token open pretraining corpus. The researchers developed a framework derived from narrative theory, encompassing three core elements—agency, setting, and events—operationalized into 11 interpretable dimensions. After annotating 400 diverse passages, they fine-tuned NarraBERT, a RoBERTa-based model, for narrative prediction. This model was then applied to 3 million passages, generating the new NarraDolma dataset. Key findings indicate that narrative structure is measurable at scale across varied web data, revealing a continuous, multidimensional narrative structure. Furthermore, narrative qualities are unevenly distributed across pretraining sources and topics, a factor not currently measured or accounted for in existing curation practices. The framework, NarraDolma dataset, and NarraBERT model are publicly released.

Key takeaway

For machine learning engineers curating pretraining data for large language models, you should recognize that narrative content is unevenly distributed across sources. This imbalance, now measurable with NarraBERT and NarraDolma, can significantly affect your model's narrative reasoning capabilities. Consider integrating narrative-aware metrics into your data selection and weighting processes to ensure more balanced and robust LLM performance on complex tasks.

Key insights

Narrative content in LLM pretraining data is measurable and unevenly distributed, impacting model capabilities.

Principles

Narrative structure is quantifiable at scale.
Web text has a continuous narrative dimension.
Data curation overlooks narrative distribution.

Method

A framework with 11 dimensions (agency, setting, events) was used to annotate 400 passages. NarraBERT, a RoBERTa model, was fine-tuned and applied to 3M passages to create NarraDolma.

In practice

Use NarraBERT for narrative prediction.
Analyze NarraDolma for data composition.
Inform LLM pretraining data curation.

Topics

LLM Pretraining Data
Narrative Theory
NarraBERT
Dolma Corpus
Data Curation
RoBERTa Model

Best for: AI Engineer, Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.