JabuticaBERT: Modern Portuguese Encoders from Scratch with RTD and Long-Context Training
Summary
JabuticaBERT introduces new high-quality monolingual encoder models for Brazilian Portuguese, addressing a significant performance challenge in natural language understanding tasks. Researchers systematically trained Portuguese-specific encoders from scratch using two modern architectures: DeBERTa, employing Replaced Token Detection (RTD), and ModernBERT, utilizing Masked Language Modeling (MLM). All models were pre-trained on the extensive Jabuticaba corpus. The DeBERTa-Large model achieved F1 scores of 0.920 on ASSIN2 RTE and 0.915 on LeNER, matching the performance of the 900M-parameter Albertina model with substantially fewer parameters. The project also released custom tokenizers designed to reduce token fertility rates compared to existing multilingual baselines.
Key takeaway
For AI Engineers developing NLU solutions for Brazilian Portuguese, consider adopting the JabuticaBERT DeBERTa-Large model. Its ability to match the 900M-parameter Albertina model's performance with significantly fewer parameters offers a compelling option for efficient deployment. Integrating the custom monolingual tokenizers can further optimize processing and reduce computational overhead for your applications.
Key insights
Monolingual encoders with careful architecture and tokenization can achieve competitive performance without massive scaling.
Principles
- Monolingual tokenization reduces token fertility rates.
- RTD training can yield strong performance with fewer parameters.
Method
Train Portuguese-specific encoder models (DeBERTa with RTD, ModernBERT with MLM) from scratch on the Jabuticaba corpus, using custom monolingual tokenizers.
In practice
- Utilize DeBERTa-Large for Brazilian Portuguese NLU.
- Employ custom tokenizers for improved token efficiency.
Topics
- JabuticaBERT
- Portuguese Encoders
- DeBERTa Architecture
- Replaced Token Detection
- Jabuticaba Corpus
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.