NorBERTo: A ModernBERT Model Trained for Portuguese with 331 Billion Tokens Corpus

· Source: Paper Index on ACL Anthology · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, medium

Summary

NorBERTo is a new encoder-only model for Portuguese Natural Language Processing (NLP), built on the ModernBERT architecture with long-context support and efficient attention. It was trained on Aurora-PT, a 331 billion GPT-2 token Brazilian Portuguese corpus, which is currently the largest openly available monolingual Portuguese corpus. Benchmarking NorBERTo against models like BERTimbau and Albertina PT-BR on tasks such as semantic similarity, textual entailment, and classification using datasets like ASSIN 2 and PLUE showed strong performance. NorBERTo-large achieved 0.9191 F1 on MRPC and 0.7689 accuracy on RTE on PLUE, and the highest entailment F1 (0.904) on ASSIN 2 among evaluated encoders, though Albertina-900M and BERTimbau-large maintained some advantages. The model is designed for realistic deployment, offering ease of fine-tuning and efficient serving.

Key takeaway

For AI Engineers and Research Scientists developing Portuguese NLP systems, NorBERTo offers a modern, mid-sized encoder that is efficient to serve and straightforward to fine-tune. Its strong performance on benchmarks like PLUE and ASSIN 2, combined with the large Aurora-PT training corpus, makes it a compelling choice for building retrieval-augmented generation and other downstream applications. Consider integrating NorBERTo to enhance your Portuguese language models.

Key insights

NorBERTo is a new ModernBERT encoder for Portuguese, trained on the massive Aurora-PT corpus, showing strong benchmark performance.

Principles

Method

NorBERTo was trained on the 331 billion GPT-2 token Aurora-PT corpus, then benchmarked against baselines on semantic similarity, textual entailment, and classification tasks using ASSIN 2 and PLUE datasets.

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.