Modelling the Morphology of Verbal Paradigms: A Case Study in the Tokenization of Turkish and Hebrew

2026-02-05 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Advanced, medium

Summary

Giuseppe Samo and Paola Merlo investigated how Transformer models represent complex verb paradigms in Turkish and Modern Hebrew, focusing on the impact of tokenization strategies. Using the Blackbird Language Matrices task on natural data, their study revealed that for Turkish, which has transparent morphological markers, both monolingual and multilingual models performed well with either atomic or subword tokenization. However, for Hebrew, which features non-concatenative morphology, monolingual and multilingual models diverged significantly. A multilingual model employing character-level tokenization failed to capture Hebrew's morphology, whereas a monolingual model with morpheme-aware segmentation achieved strong performance. The researchers also noted that performance across all models improved when evaluated on more synthetic datasets.

Key takeaway

For research scientists developing or fine-tuning language models for morphologically rich languages, especially those with non-concatenative structures like Hebrew, you should prioritize morpheme-aware tokenization. Relying on character-level or generic subword tokenization for such languages can severely hinder a model's ability to capture essential linguistic nuances, leading to suboptimal performance in tasks involving complex verb paradigms.

Key insights

Tokenization strategy critically impacts Transformer models' ability to represent complex morphology, especially in non-concatenative languages.

Principles

Transparent morphology benefits from diverse tokenization.
Non-concatenative morphology requires morpheme-aware tokenization.

Method

Evaluated Transformer models on Turkish and Hebrew verb paradigms using the Blackbird Language Matrices task, comparing atomic, subword, character-level, and morpheme-aware tokenization.

In practice

Prioritize morpheme-aware tokenization for Hebrew.
Consider synthetic data for model performance boosts.

Topics

Transformer Models
Tokenization Strategies
Morphological Analysis
Turkish Language Processing
Hebrew Language Processing

Best for: Research Scientist, AI Researcher, NLP Engineer, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.