Linguistic Issues in Computational Linguistics for Non-English Languages
Summary
Current computational linguistics and multilingual large language model (LLM) ecosystems are structurally English-centric in their training data, tokenization, representation, and evaluation. This results in systematic weaknesses for non-English languages across morphology, lexis, syntax, semantics, and discourse. Morphologically, non-Latin languages are often over-tokenized, necessitating morphology-aware tokenization and attention to hapaxes, with examples like finite-state techniques for Arabic. Lexically, corpus-driven research and tools like Sketch Engine are crucial for empirical evidence, especially in high-stakes domains. Syntactically, long-term dependencies challenge non-English languages, making frameworks like Universal Dependencies useful. Semantically, LLMs rely on statistical patterns and self-attention, often routing reasoning through English in the latent space, which degrades performance and fluency. This bias extends to discourse understanding, requiring relevant datasets for anaphora and entity coherence, and even to spoken language and multimodal datasets, where only 5.7% of music datasets are non-Western.
Key takeaway
For Machine Learning Engineers developing multilingual LLM applications, recognize that current models inherently route reasoning through English, impacting non-English performance. You should prioritize morphology-aware tokenization for low-resource languages and integrate corpus-driven tools for empirical validation in high-stakes domains. Consider intentionally using English in intermediate inference steps if it improves target language output, and invest in diverse, non-English discourse datasets to mitigate inherent biases.
Key insights
English-centric LLM design creates systematic linguistic weaknesses for non-English languages across multiple levels.
Principles
- LLMs make key linguistic decisions closest to English.
- Corpus-driven research strengthens robustness.
- Statistical language-agnostic tokenization underperforms.
Method
For non-English languages, use morphology-aware tokenization, address hapaxes, and apply finite-state techniques for complex morphology.
In practice
- Use morphology-aware tokenization for low-resource languages.
- Employ corpus linguistic tools like Sketch Engine for legal ML.
- Apply Universal Dependencies for non-English syntax parsing.
Topics
- Computational Linguistics
- Multilingual LLMs
- Non-English Languages
- English-centric Bias
- Morphology-aware Tokenization
- Universal Dependencies
- Corpus Linguistics
Code references
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by NLP on Medium.