JabuticaBERT: Modern Portuguese Encoders from Scratch with RTD and Long-Context Training

· Source: Paper Index on ACL Anthology · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, medium

Summary

JabuticaBERT introduces new high-quality monolingual encoder models for Brazilian Portuguese, addressing a significant performance challenge in natural language understanding tasks. Researchers systematically trained Portuguese-specific encoders from scratch using two modern architectures: DeBERTa, employing Replaced Token Detection (RTD), and ModernBERT, utilizing Masked Language Modeling (MLM). All models were pre-trained on the extensive Jabuticaba corpus. The DeBERTa-Large model achieved F1 scores of 0.920 on ASSIN2 RTE and 0.915 on LeNER, matching the performance of the 900M-parameter Albertina model with substantially fewer parameters. The project also released custom tokenizers designed to reduce token fertility rates compared to existing multilingual baselines.

Key takeaway

For AI Engineers developing NLU solutions for Brazilian Portuguese, consider adopting the JabuticaBERT DeBERTa-Large model. Its ability to match the 900M-parameter Albertina model's performance with significantly fewer parameters offers a compelling option for efficient deployment. Integrating the custom monolingual tokenizers can further optimize processing and reduce computational overhead for your applications.

Key insights

Monolingual encoders with careful architecture and tokenization can achieve competitive performance without massive scaling.

Principles

Method

Train Portuguese-specific encoder models (DeBERTa with RTD, ModernBERT with MLM) from scratch on the Jabuticaba corpus, using custom monolingual tokenizers.

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.