Structure Before Collapse: Transient semantic geometry in next-token prediction
Summary
The paper "Structure Before Collapse: Transient semantic geometry in next-token prediction" by Yize Zhao, Isabel Papadimitriou, and Christos Thrampoulidis investigates a paradox in language models. Despite being trained predominantly with one-hot labels, which Neural Collapse theory predicts should lead to symmetric, semantically undifferentiated representations, these models clearly learn latent structural features. Through three synthetic controlled settings, the authors demonstrate that semantic geometry emerges early in training, causing representations to cluster by shared attributes even without explicit supervision. This emergent structure is transient; with sufficient capacity and training time, the model eventually converges to the predicted symmetric state. The study employs Gram matrix analysis to examine this phase transition and proposes a preliminary modification to the commonly used unconstrained features model to better capture the emergent semantic geometry.
Key takeaway
For AI Scientists optimizing language model training, recognize that valuable semantic structure emerges early but is transient. Your models may learn rich semantic geometry initially, only for it to collapse into symmetric, less semantically useful representations with prolonged training. Consider strategies to capture or stabilize this emergent geometry, perhaps by adjusting training duration, implementing regularization, or exploring architectural modifications to preserve these critical latent features.
Key insights
Semantic structure in next-token prediction LMs emerges transiently despite one-hot training, before collapsing to symmetric representations.
Principles
- Neural Collapse predicts symmetric representations in one-hot classification.
- Semantic geometry can emerge without explicit supervision.
- Early training phases can exhibit transient, structured representations.
Method
Investigated semantic structure emergence using synthetic settings and Gram matrix analysis. Proposed a preliminary modification to the unconstrained features model.
In practice
- Analyze representation dynamics with Gram matrix.
- Consider early training phases for semantic structure.
- Explore model modifications to preserve emergent geometry.
Topics
- Next-token prediction
- Neural Collapse
- Semantic Geometry
- Language Models
- Representation Learning
- Gradient Descent Dynamics
Code references
Best for: Research Scientist, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.