Probing for Reading Times

2026-04-22 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

This study investigates whether language model (LM) representations capture human reading times, comparing them against scalar predictors like surprisal, information value, and logit-lens surprisal. Researchers used regularized linear regression on two eye-tracking corpora, Provo and MECO, spanning five languages: English, Greek, Hebrew, Russian, and Turkish. They evaluated mGPT (24 layers, 2048 embedding dimension), GPT-2 Small (12 layers, 768 embedding dimension), and cosmosGPT (12 layers, 768 embedding dimension). The findings indicate that early-layer representations (layers 1-10 for mGPT) are superior for predicting early-pass measures such as first fixation and gaze duration. In contrast, scalar surprisal remains more effective for late-pass measures like total reading time. The study also observed significant cross-lingual variation in predictor performance, with combining surprisal and early-layer representations often improving predictive power.

Key takeaway

For NLP engineers developing psycholinguistic models, recognize that early-layer language model representations are more effective for predicting initial human reading behaviors (e.g., first fixation, gaze duration), while traditional surprisal remains superior for higher-level processing (total reading time). You should consider integrating both types of predictors and tuning models specifically for the target language and eye-tracking measure to achieve optimal predictive accuracy. This nuanced understanding can refine models of human language processing.

Key insights

Early language model layers predict initial human reading effort better than scalar surprisal, which excels at later stages.

Principles

Model depth aligns with temporal stages of human reading.
Scalar compression discards psychometrically relevant information.
Predictor effectiveness varies by language and eye-tracking measure.

Method

Regularized linear regression predicts unit-level reading times from LM representations and scalar baselines (surprisal, information value, logit-lens surprisal) across multiple layers and languages, using 10-fold cross-validation.

In practice

Use early LM layers for initial processing predictions.
Combine representations with surprisal for improved performance.
Consider language-specific model tuning for reading time tasks.

Topics

Language Model Representations
Human Reading Times
Eye-tracking Corpora
Surprisal Theory
Logit Lens

Code references

Best for: NLP Engineer, AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.