Understanding the Parameter Space Geometry of Transformers Encoding Boolean Functions
Summary
Transformers consistently fail to learn certain simple functions, such as PARITY, despite their expressivity. This phenomenon is particularly evident for sensitive functions, where output changes significantly with a single input bit flip. While a bias towards low average sensitivity functions has been noted, the underlying mechanism was unclear. New research reveals that sensitive functions, even when representable, occupy a vanishingly small region within the transformer's parameter space. This makes them highly unlikely to be discovered through random initialization. The study shifts focus from average sensitivity to the full sensitivity profile, proving that randomly initialized transformers almost surely compute functions with low-sensitivity strings, rendering any function lacking such strings provably unlearnable.
Key takeaway
For AI scientists developing or training transformer models, recognize that functions requiring high sensitivity across all inputs are inherently difficult to learn due to parameter space geometry. Your models will likely default to low-sensitivity behaviors, necessitating targeted architectural changes or initialization strategies to represent sensitive functions effectively. Consider this limitation when designing models for tasks involving complex logical operations or high input sensitivity.
Key insights
Transformers struggle with sensitive functions because their parameter space geometry makes these functions nearly impossible to find via random initialization.
Principles
- Transformers exhibit a bias toward low average sensitivity functions.
- Sensitive functions occupy a vanishingly small parameter space region.
- Random initialization almost surely misses sensitive function regions.
Topics
- Transformers
- Parameter Space Geometry
- Boolean Functions
- Function Learnability
- Model Initialization
- Sensitivity Bias
Best for: Research Scientist, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.