Modeling semantic association in self-paced reading with language model embeddings

2026-06-08 · Source: cs.CL updates on arXiv.org · Field: Science & Research — Social Sciences & Behavioral Studies, Research Methodology & Innovation · Depth: Expert, extended

Summary

This study investigated how language model (LM) embeddings quantify semantic association in self-paced reading, analyzing its effects on N400 brain potentials and reading times. Researchers used a corpus of joint electroencephalography (EEG) and self-paced reading data from 56 participants reading natural Dutch texts from the Tilburg corpus (TiNT). Ten different implementations of semantic association were tested, varying embedding models (uncontextualized "wikipedia2vec_nlwiki_20180420_300d" and contextualized "e5-large-trm-nl" sentence embeddings) and context lengths. Bayesian hierarchical models and Bayes factors revealed that the choice of embedding model significantly alters the estimated effect of semantic association. Specifically, sentence embeddings demonstrated reliable effects on both neural and behavioral measures, unlike word embeddings, highlighting the critical role of methodological choices.

Key takeaway

For research scientists modeling language processing, carefully consider your choice of embedding model when quantifying semantic association. This study indicates that contextualized sentence embeddings, like "e5-large-trm-nl", are more effective than uncontextualized word embeddings for reliably predicting N400 and reading times in naturalistic text. You should prioritize sentence embeddings and explore varying context window definitions to accurately capture semantic effects beyond word predictability.

Key insights

Methodological choices, especially embedding model type, critically impact semantic association quantification in reading comprehension.

Principles

Sentence embeddings capture semantic association more reliably than word embeddings in naturalistic reading.
The choice of embedding model can reverse the direction of estimated N400 effects.
Context length influences semantic association effects, particularly with sentence embeddings.

Method

Semantic association is quantified as cosine similarity between a critical word's embedding and its context's embedding, varying embedding models and context lengths.

In practice

Prioritize contextualized sentence embeddings over uncontextualized word embeddings for semantic association tasks.
Experiment with different context lengths when using sentence embeddings to optimize effect capture.

Topics

Semantic Association
Self-Paced Reading
Electroencephalography
N400 ERP Component
Language Model Embeddings
Sentence Embeddings

Code references

saraoe/semantic_association

Best for: AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.