UlyssesLegalNER-Br: from Legislative to Legal, a comprehensive corpus of Brazilian legal documents for Named Entity Recognition

· Source: Paper Index on ACL Anthology · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, medium

Summary

UlyssesLegalNER-Br is a new, comprehensive corpus of Brazilian legal documents designed for Named Entity Recognition (NER), addressing the scarcity of public datasets in the legal domain. This corpus expands upon the previous UlyssesNER-Br, which focused solely on legislative texts, to now include bills, case laws, and general laws, notably featuring the first NER corpus exclusively based on Brazilian laws. Comprising 560 public documents, the corpus was annotated using a hybrid approach, categorizing entities into 9 broad categories and 23 fine-grained types. Experimental evaluations were conducted using CRF, BiLSTM, and BERTimbau architectures, assessing predictive performance, computational cost, and label-level results. BERTimbau achieved the highest micro F1 score of 96.18% on the unified corpus, establishing a strong baseline for Brazilian legal NER, with six categories and seven types exceeding 95% F1-score.

Key takeaway

For research scientists developing NLP solutions for the Brazilian legal sector, UlyssesLegalNER-Br provides a critical, publicly available dataset. You should consider this corpus for training and evaluating Named Entity Recognition models, particularly leveraging BERTimbau as a robust baseline. This resource can significantly accelerate the development of more accurate and comprehensive legal AI applications, overcoming previous data scarcity challenges.

Key insights

UlyssesLegalNER-Br offers a comprehensive, publicly available corpus for Brazilian legal Named Entity Recognition.

Principles

Method

The corpus was created using a hybrid annotation approach across 560 public Brazilian legal documents, organized into 9 categories and 23 fine-grained types, and evaluated with CRF, BiLSTM, and BERTimbau models.

In practice

Topics

Best for: Research Scientist, AI Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.