Modelling the Morphology of Verbal Paradigms: A Case Study in the Tokenization of Turkish and Hebrew
Summary
Giuseppe Samo and Paola Merlo investigated how Transformer models represent complex verb paradigms in Turkish and Modern Hebrew, focusing on the impact of tokenization strategies. Using the Blackbird Language Matrices task on natural data, their study revealed that for Turkish, which has transparent morphological markers, both monolingual and multilingual models performed well with either atomic or subword tokenization. However, for Hebrew, which features non-concatenative morphology, monolingual and multilingual models diverged significantly. A multilingual model employing character-level tokenization failed to capture Hebrew's morphology, whereas a monolingual model with morpheme-aware segmentation achieved strong performance. The researchers also noted that performance across all models improved when evaluated on more synthetic datasets.
Key takeaway
For research scientists developing or fine-tuning language models for morphologically rich languages, especially those with non-concatenative structures like Hebrew, you should prioritize morpheme-aware tokenization. Relying on character-level or generic subword tokenization for such languages can severely hinder a model's ability to capture essential linguistic nuances, leading to suboptimal performance in tasks involving complex verb paradigms.
Key insights
Tokenization strategy critically impacts Transformer models' ability to represent complex morphology, especially in non-concatenative languages.
Principles
- Transparent morphology benefits from diverse tokenization.
- Non-concatenative morphology requires morpheme-aware tokenization.
Method
Evaluated Transformer models on Turkish and Hebrew verb paradigms using the Blackbird Language Matrices task, comparing atomic, subword, character-level, and morpheme-aware tokenization.
In practice
- Prioritize morpheme-aware tokenization for Hebrew.
- Consider synthetic data for model performance boosts.
Topics
- Transformer Models
- Tokenization Strategies
- Morphological Analysis
- Turkish Language Processing
- Hebrew Language Processing
Best for: Research Scientist, AI Researcher, NLP Engineer, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.