Modular Monolingual Adaptation using Pretrained Language Models

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

This work introduces a modular monolingual adaptation strategy for pretrained language models (PLMs) to improve performance in low-resource languages like Scottish Gaelic, Irish, and Quechua (8.5k training instances). Instead of full model finetuning, the proposed method replaces the tokenizer with a language-specific one, initializes new embeddings (using model-based, FastText, or random strategies), freezes these embedding layers, and trains only the non-embedding parameters using a masked language modeling objective. Experiments across BERT, mBERT, and mmBERT models demonstrate that this non-embedding training consistently outperforms or matches full finetuning on mask-filling tasks, while requiring significantly fewer trainable parameters (reducing them by approximately 25% for mBERT) and lowering training costs. For downstream NER and POS tagging tasks, the modular approach performs comparably to full finetuning, validating its effectiveness and efficiency. The custom tokenizer is crucial for performance gains, while embedding initialization choice has a minor impact.

Key takeaway

For Machine Learning Engineers adapting pretrained language models to low-resource languages, you should adopt a modular approach by using a custom tokenizer and freezing embedding layers. This strategy, training only non-embedding parameters, consistently outperforms full finetuning and LoRA on NLU tasks. It also significantly reduces trainable parameters and VRAM usage. Prioritize custom tokenization for substantial performance gains, as embedding initialization choice has a minor impact. This method offers a more efficient and effective path for low-resource language model development.

Key insights

Freezing embeddings during monolingual adaptation of PLMs for low-resource languages prevents overfitting and improves performance.

Principles

Method

Build a custom WordPiece tokenizer (30k vocab), initialize embeddings (model, FastText, or random), freeze input/output embeddings, then train non-embedding parameters with MLM objective.

In practice

Topics

Code references

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.