What exactly does word2vec learn?
Summary
A new paper provides a quantitative and predictive theory for how `word2vec` learns dense vector representations of words, a process previously lacking a comprehensive theoretical description. The research demonstrates that, under realistic conditions, `word2vec`'s learning problem simplifies to unweighted least-squares matrix factorization. The theory, which solves gradient flow dynamics in closed form, reveals that the final learned representations are equivalent to Principal Component Analysis (PCA) of a specific target matrix, $M^{\star}_{ij} = \frac{P(i,j) - P(i)P(j)}{\frac{1}{2}(P(i,j) + P(i)P(j))}$. This matrix is defined by word co-occurrence and unigram probabilities. The study shows `word2vec` learns in discrete, sequential steps, incrementing the rank of the embedding matrix by acquiring one "concept" or orthogonal linear subspace at a time, with each concept corresponding to an interpretable topic. The theoretical model achieves 66% accuracy on analogy completion, closely matching `word2vec`'s 68%.
Key takeaway
For research scientists developing or analyzing neural language models, understanding `word2vec`'s learning dynamics as PCA on a specific corpus-derived matrix provides a foundational insight. This theory offers a closed-form solution for feature learning, enabling a priori prediction of learned concepts and potentially guiding the design of more interpretable and controllable representation learning algorithms. You should consider how similar matrix factorization principles might apply to more complex LLMs.
Key insights
`word2vec`'s learning process is equivalent to PCA on a specific corpus-derived matrix.
Principles
- `word2vec` learns concepts sequentially.
- Learned features are eigenvectors of $M^{\star}_{ij}$.
Method
The learning process reduces to unweighted least-squares matrix factorization, with gradient flow dynamics solvable in closed form, yielding PCA as the final representation.
In practice
- Inspect $M^{\star}$ eigenvectors to predict learned features.
- Use PCA on $M^{\star}$ to replicate `word2vec` embeddings.
Topics
- word2vec
- Representation Learning
- Matrix Factorization
- PCA
- Language Models
Best for: Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The Berkeley Artificial Intelligence Research Blog.