Twenty Years of HAREM: A Reproducible Audit and Reassessment of Portuguese Named Entity Recognition

2026-04-12 · Source: Paper Index on ACL Anthology · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Natural Language Processing · Depth: Expert, quick

Summary

A systematic audit of the HAREM corpus, a foundational benchmark for Portuguese Named Entity Recognition (NER) for two decades, has revealed 153 overlapping (contaminated) sentences in its fixed train/test split. Researchers re-evaluated 13 NER models, ranging from CRFs to Transformers, on both the original and a newly decontaminated version of the corpus. Statistical analysis indicated that decontamination significantly (p < 0.05) and positively impacted the majority of models, with performance gains most pronounced in the F1_textmacro score, showing up to a +4 point increase. This suggests that the contamination primarily hindered generalization on rare entity types. The audit also uncovered evidence of overfitting in some models that benefited from this data leakage, concluding that even minor contamination can distort performance metrics and mask true model generalization. A decontaminated benchmark has been released to ensure more reliable future evaluations.

Key takeaway

For research scientists developing or evaluating Portuguese NER models, you should immediately transition to using the newly released decontaminated HAREM benchmark. Relying on the original corpus risks overestimating model performance and misinterpreting generalization capabilities, particularly for rare entity types, due to previously undetected data leakage. Ensure your future evaluations are based on this more reliable standard to accurately assess model advancements.

Key insights

Data contamination in benchmarks significantly distorts model performance and masks true generalization capabilities.

Principles

Benchmark integrity is crucial for valid model evaluation.
Data leakage can lead to inflated performance metrics.

Method

The audit involved identifying overlapping sentences in the HAREM corpus's train/test split, decontaminating the corpus, and then re-evaluating 13 NER models on both versions to assess performance differences.

In practice

Audit existing datasets for train/test contamination.
Prioritize F1_textmacro for rare entity generalization.

Topics

Named Entity Recognition
HAREM Corpus
Data Contamination
Model Generalization
Portuguese Language Processing

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.