Modular Monolingual Adaptation using Pretrained Language Models
Summary
This work introduces a modular monolingual adaptation strategy for pretrained language models (PLMs) to improve performance in low-resource languages like Scottish Gaelic, Irish, and Quechua (8.5k training instances). Instead of full model finetuning, the proposed method replaces the tokenizer with a language-specific one, initializes new embeddings (using model-based, FastText, or random strategies), freezes these embedding layers, and trains only the non-embedding parameters using a masked language modeling objective. Experiments across BERT, mBERT, and mmBERT models demonstrate that this non-embedding training consistently outperforms or matches full finetuning on mask-filling tasks, while requiring significantly fewer trainable parameters (reducing them by approximately 25% for mBERT) and lowering training costs. For downstream NER and POS tagging tasks, the modular approach performs comparably to full finetuning, validating its effectiveness and efficiency. The custom tokenizer is crucial for performance gains, while embedding initialization choice has a minor impact.
Key takeaway
For Machine Learning Engineers adapting pretrained language models to low-resource languages, you should adopt a modular approach by using a custom tokenizer and freezing embedding layers. This strategy, training only non-embedding parameters, consistently outperforms full finetuning and LoRA on NLU tasks. It also significantly reduces trainable parameters and VRAM usage. Prioritize custom tokenization for substantial performance gains, as embedding initialization choice has a minor impact. This method offers a more efficient and effective path for low-resource language model development.
Key insights
Freezing embeddings during monolingual adaptation of PLMs for low-resource languages prevents overfitting and improves performance.
Principles
- Full model tuning is often unnecessary for low-resource adaptation.
- Custom language-specific tokenizers significantly boost performance.
- Freezing embeddings acts as an effective regularization strategy.
Method
Build a custom WordPiece tokenizer (30k vocab), initialize embeddings (model, FastText, or random), freeze input/output embeddings, then train non-embedding parameters with MLM objective.
In practice
- Implement custom tokenizers for low-resource language LMs.
- Freeze embedding layers to prevent overfitting in low-resource settings.
- Prioritize non-embedding parameter tuning over full finetuning.
Topics
- Low-Resource Languages
- Pretrained Language Models
- Modular Adaptation
- Tokenizer Customization
- Embedding Freezing
- Natural Language Understanding
- Parameter Efficiency
Code references
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.