World Properties without World Models: Recovering Spatial and Temporal Structure from Co-occurrence Statistics in Static Word Embeddings

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

Recent research demonstrates that static co-occurrence-based word embeddings, such as GloVe and Word2Vec, can recover significant spatial and temporal structure, challenging the interpretation that such capabilities are exclusive to large language models' "world-like" internal representations. Applying ridge regression probes, the study found substantial recoverable geographic signals with held-out R^2 values of 0.71-0.87 for city coordinates, and reliable temporal signals with R^2 values of 0.48-0.52 for historical birth years. Semantic-neighbor analyses revealed these signals are strongly dependent on interpretable lexical gradients, particularly country names and climate-related vocabulary. These findings suggest that ordinary word co-occurrence inherently preserves richer spatial, temporal, and environmental structure than previously assumed, indicating that linear probe recoverability alone does not establish a representational move beyond text.

Key takeaway

Static co-occurrence embeddings (GloVe, Word2Vec) inherently capture significant spatial and temporal "world properties" from text alone. Using ridge regression, they achieve R^2 values of 0.71-0.87 for city coordinates and 0.48-0.52 for historical birth years, driven by interpretable lexical gradients. This suggests linear probe recoverability in LLMs doesn't solely indicate complex "world models," as much structure originates from basic textual co-occurrence.

Topics

Best for: AI Scientist, Research Scientist, AI Researcher, NLP Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.