When transformers learn "impossible" languages, what do they learn?
Summary
A recent study investigates why transformer language models exhibit a bias against "impossible" languages, which are theorized to be unacquirable by humans. Moving beyond prior work focused on sample efficiency and test-set perplexity, this research directly evaluates two linguistic capacities: grammatical sensitivity and generative production. Using GPT-2 style models trained on perturbed "impossible" variants of English, the authors measured sensitivity to grammaticality via BLiMP minimal pairs. They found that model performance degraded only gradually, primarily mediated by the language's information locality. However, these models demonstrated significant failures in generation, producing substantially fewer high-quality sentences at increased lengths. These findings suggest that generative deficiency and transmission failures are key factors explaining why language models struggle with non-attested "impossible" languages.
Key takeaway
For NLP Engineers developing or evaluating advanced language models, recognize that generative capacity is a critical bottleneck for complex linguistic structures. Specifically, models struggle at longer sequence lengths. Your evaluation metrics should extend beyond perplexity to include direct assessments of generative quality and grammatical sensitivity using tools like BLiMP. Prioritize architectural improvements that enhance long-range dependency handling to mitigate these generative deficiencies.
Key insights
Transformers' struggle with "impossible" languages stems from generative deficiency, not primarily grammatical insensitivity.
Principles
- Information locality mediates grammatical sensitivity degradation.
- Generative production failures increase with sentence length.
- Linguistic "impossibility" links to generative and transmission failures.
Method
GPT-2 style models were trained on perturbed "impossible" English variants. Grammatical sensitivity was measured using BLiMP minimal pairs, and generative production was evaluated for high-quality sentence output at varying lengths.
In practice
- Evaluate generative quality at longer lengths.
- Consider information locality in language design.
- Test models with BLiMP minimal pairs.
Topics
- Transformer Models
- Language Acquisition
- Grammatical Sensitivity
- Generative Models
- NLP Evaluation
- Linguistic Theory
Best for: Research Scientist, AI Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.