Twenty Years of HAREM: A Reproducible Audit and Reassessment of Portuguese Named Entity Recognition
Summary
A systematic audit of the HAREM corpus, a foundational benchmark for Portuguese Named Entity Recognition (NER) for two decades, has revealed 153 overlapping (contaminated) sentences in its fixed train/test split. Researchers re-evaluated 13 NER models, ranging from CRFs to Transformers, on both the original and a newly decontaminated version of the corpus. Statistical analysis indicated that decontamination significantly (p < 0.05) and positively impacted the majority of models, with performance gains most pronounced in the F1_textmacro score, showing up to a +4 point increase. This suggests that the contamination primarily hindered generalization on rare entity types. The audit also uncovered evidence of overfitting in some models that benefited from this data leakage, concluding that even minor contamination can distort performance metrics and mask true model generalization. A decontaminated benchmark has been released to ensure more reliable future evaluations.
Key takeaway
For research scientists developing or evaluating Portuguese NER models, you should immediately transition to using the newly released decontaminated HAREM benchmark. Relying on the original corpus risks overestimating model performance and misinterpreting generalization capabilities, particularly for rare entity types, due to previously undetected data leakage. Ensure your future evaluations are based on this more reliable standard to accurately assess model advancements.
Key insights
Data contamination in benchmarks significantly distorts model performance and masks true generalization capabilities.
Principles
- Benchmark integrity is crucial for valid model evaluation.
- Data leakage can lead to inflated performance metrics.
Method
The audit involved identifying overlapping sentences in the HAREM corpus's train/test split, decontaminating the corpus, and then re-evaluating 13 NER models on both versions to assess performance differences.
In practice
- Audit existing datasets for train/test contamination.
- Prioritize F1_textmacro for rare entity generalization.
Topics
- Named Entity Recognition
- HAREM Corpus
- Data Contamination
- Model Generalization
- Portuguese Language Processing
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.