Global vs. Local Sentence Embeddings for Brazilian Portuguese: Revisiting Monolingual Models in the Age of Foundation Models
Summary
An empirical study presented at PROPOR 2026 by Peixoto et al. investigates the trade-off between large-scale, multilingual foundation models and specialized monolingual models for Brazilian Portuguese (PT-BR) sentence embeddings. The research evaluates various language model families using both linear probing and fine-tuning. Findings indicate that monolingual encoders demonstrate superior "adaptation plasticity" during fine-tuning, enhancing performance in classification and semantic similarity tasks, while global (multilingual) models show degradation. However, this adaptability comes with a drawback: monolingual models struggle with foreign terms, whereas modern multilingual tokenizers exhibit unexpected morphological competence. The study concludes that the optimal model selection is contingent on the specific task, balancing vocabulary coverage against adaptation flexibility.
Key takeaway
For AI Engineers developing natural language processing solutions for Brazilian Portuguese, selecting between monolingual and multilingual models requires a task-specific evaluation. You should prioritize monolingual encoders for tasks requiring high adaptation plasticity, such as classification or semantic similarity, especially when fine-tuning is an option. However, be mindful of their limitations with foreign terms and consider integrating multilingual tokenizers to address vocabulary coverage challenges.
Key insights
Optimal model choice for PT-BR sentence embeddings balances vocabulary coverage and adaptation flexibility.
Principles
- Monolingual encoders offer greater adaptation plasticity.
- Multilingual tokenizers show surprising morphological competence.
Method
The study empirically evaluates multiple language model families using linear probing and fine-tuning regimes across diverse tasks to compare generalization vs. specialization.
In practice
- Prioritize monolingual models for fine-tuning tasks.
- Consider multilingual tokenizers for foreign term handling.
Topics
- Sentence Embeddings
- Brazilian Portuguese
- Foundation Models
- Monolingual Models
- Multilingual Models
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.