NorBERTo: A ModernBERT Model Trained for Portuguese with 331 Billion Tokens Corpus
Summary
NorBERTo is a new encoder-only model for Portuguese Natural Language Processing (NLP), built on the ModernBERT architecture with long-context support and efficient attention. It was trained on Aurora-PT, a 331 billion GPT-2 token Brazilian Portuguese corpus, which is currently the largest openly available monolingual Portuguese corpus. Benchmarking NorBERTo against models like BERTimbau and Albertina PT-BR on tasks such as semantic similarity, textual entailment, and classification using datasets like ASSIN 2 and PLUE showed strong performance. NorBERTo-large achieved 0.9191 F1 on MRPC and 0.7689 accuracy on RTE on PLUE, and the highest entailment F1 (0.904) on ASSIN 2 among evaluated encoders, though Albertina-900M and BERTimbau-large maintained some advantages. The model is designed for realistic deployment, offering ease of fine-tuning and efficient serving.
Key takeaway
For AI Engineers and Research Scientists developing Portuguese NLP systems, NorBERTo offers a modern, mid-sized encoder that is efficient to serve and straightforward to fine-tune. Its strong performance on benchmarks like PLUE and ASSIN 2, combined with the large Aurora-PT training corpus, makes it a compelling choice for building retrieval-augmented generation and other downstream applications. Consider integrating NorBERTo to enhance your Portuguese language models.
Key insights
NorBERTo is a new ModernBERT encoder for Portuguese, trained on the massive Aurora-PT corpus, showing strong benchmark performance.
Principles
- High-quality corpora are essential for NLP advancement.
- ModernBERT architecture supports long-context and efficient attention.
Method
NorBERTo was trained on the 331 billion GPT-2 token Aurora-PT corpus, then benchmarked against baselines on semantic similarity, textual entailment, and classification tasks using ASSIN 2 and PLUE datasets.
In practice
- Use NorBERTo as a backbone for RAG systems.
- Deploy NorBERTo for efficient Portuguese NLP tasks.
Topics
- NorBERTo
- ModernBERT Architecture
- Portuguese NLP
- Aurora-PT Corpus
- Textual Entailment
Best for: AI Engineer, Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.