Modeling semantic association in self-paced reading with language model embeddings
Summary
This study investigated how language model (LM) embeddings quantify semantic association in self-paced reading, analyzing its effects on N400 brain potentials and reading times. Researchers used a corpus of joint electroencephalography (EEG) and self-paced reading data from 56 participants reading natural Dutch texts from the Tilburg corpus (TiNT). Ten different implementations of semantic association were tested, varying embedding models (uncontextualized "wikipedia2vec_nlwiki_20180420_300d" and contextualized "e5-large-trm-nl" sentence embeddings) and context lengths. Bayesian hierarchical models and Bayes factors revealed that the choice of embedding model significantly alters the estimated effect of semantic association. Specifically, sentence embeddings demonstrated reliable effects on both neural and behavioral measures, unlike word embeddings, highlighting the critical role of methodological choices.
Key takeaway
For research scientists modeling language processing, carefully consider your choice of embedding model when quantifying semantic association. This study indicates that contextualized sentence embeddings, like "e5-large-trm-nl", are more effective than uncontextualized word embeddings for reliably predicting N400 and reading times in naturalistic text. You should prioritize sentence embeddings and explore varying context window definitions to accurately capture semantic effects beyond word predictability.
Key insights
Methodological choices, especially embedding model type, critically impact semantic association quantification in reading comprehension.
Principles
- Sentence embeddings capture semantic association more reliably than word embeddings in naturalistic reading.
- The choice of embedding model can reverse the direction of estimated N400 effects.
- Context length influences semantic association effects, particularly with sentence embeddings.
Method
Semantic association is quantified as cosine similarity between a critical word's embedding and its context's embedding, varying embedding models and context lengths.
In practice
- Prioritize contextualized sentence embeddings over uncontextualized word embeddings for semantic association tasks.
- Experiment with different context lengths when using sentence embeddings to optimize effect capture.
Topics
- Semantic Association
- Self-Paced Reading
- Electroencephalography
- N400 ERP Component
- Language Model Embeddings
- Sentence Embeddings
Code references
Best for: AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.