When transformers learn "impossible" languages, what do they learn?

2026-06-29 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A recent study investigates why transformer language models exhibit a bias against "impossible" languages, which are theorized to be unacquirable by humans. Moving beyond prior work focused on sample efficiency and test-set perplexity, this research directly evaluates two linguistic capacities: grammatical sensitivity and generative production. Using GPT-2 style models trained on perturbed "impossible" variants of English, the authors measured sensitivity to grammaticality via BLiMP minimal pairs. They found that model performance degraded only gradually, primarily mediated by the language's information locality. However, these models demonstrated significant failures in generation, producing substantially fewer high-quality sentences at increased lengths. These findings suggest that generative deficiency and transmission failures are key factors explaining why language models struggle with non-attested "impossible" languages.

Key takeaway

For NLP Engineers developing or evaluating advanced language models, recognize that generative capacity is a critical bottleneck for complex linguistic structures. Specifically, models struggle at longer sequence lengths. Your evaluation metrics should extend beyond perplexity to include direct assessments of generative quality and grammatical sensitivity using tools like BLiMP. Prioritize architectural improvements that enhance long-range dependency handling to mitigate these generative deficiencies.

Key insights

Transformers' struggle with "impossible" languages stems from generative deficiency, not primarily grammatical insensitivity.

Principles

Information locality mediates grammatical sensitivity degradation.
Generative production failures increase with sentence length.
Linguistic "impossibility" links to generative and transmission failures.

Method

GPT-2 style models were trained on perturbed "impossible" English variants. Grammatical sensitivity was measured using BLiMP minimal pairs, and generative production was evaluated for high-quality sentence output at varying lengths.

In practice

Evaluate generative quality at longer lengths.
Consider information locality in language design.
Test models with BLiMP minimal pairs.

Topics

Transformer Models
Language Acquisition
Grammatical Sensitivity
Generative Models
NLP Evaluation
Linguistic Theory

Best for: Research Scientist, AI Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.