Probing for Reading Times
Summary
This study investigates whether language model (LM) representations capture human reading times, comparing them against scalar predictors like surprisal, information value, and logit-lens surprisal. Researchers used regularized linear regression on two eye-tracking corpora, Provo and MECO, spanning five languages: English, Greek, Hebrew, Russian, and Turkish. They evaluated mGPT (24 layers, 2048 embedding dimension), GPT-2 Small (12 layers, 768 embedding dimension), and cosmosGPT (12 layers, 768 embedding dimension). The findings indicate that early-layer representations (layers 1-10 for mGPT) are superior for predicting early-pass measures such as first fixation and gaze duration. In contrast, scalar surprisal remains more effective for late-pass measures like total reading time. The study also observed significant cross-lingual variation in predictor performance, with combining surprisal and early-layer representations often improving predictive power.
Key takeaway
For NLP engineers developing psycholinguistic models, recognize that early-layer language model representations are more effective for predicting initial human reading behaviors (e.g., first fixation, gaze duration), while traditional surprisal remains superior for higher-level processing (total reading time). You should consider integrating both types of predictors and tuning models specifically for the target language and eye-tracking measure to achieve optimal predictive accuracy. This nuanced understanding can refine models of human language processing.
Key insights
Early language model layers predict initial human reading effort better than scalar surprisal, which excels at later stages.
Principles
- Model depth aligns with temporal stages of human reading.
- Scalar compression discards psychometrically relevant information.
- Predictor effectiveness varies by language and eye-tracking measure.
Method
Regularized linear regression predicts unit-level reading times from LM representations and scalar baselines (surprisal, information value, logit-lens surprisal) across multiple layers and languages, using 10-fold cross-validation.
In practice
- Use early LM layers for initial processing predictions.
- Combine representations with surprisal for improved performance.
- Consider language-specific model tuning for reading time tasks.
Topics
- Language Model Representations
- Human Reading Times
- Eye-tracking Corpora
- Surprisal Theory
- Logit Lens
Code references
Best for: NLP Engineer, AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.