Anatomy of Data Repositories for the Analysis and Detection of Toxicity in Portuguese

2026-04-12 · Source: Paper Index on ACL Anthology · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

This study analyzes six Brazilian Portuguese datasets designed for hate speech and toxicity detection to understand how their lexical composition and domain characteristics influence cross-domain generalization. Researchers combined HurtLex-based lexical profiling with cross-dataset evaluation using BERTimbau embeddings and an XGBoost classifier in a feature-based transfer-learning setup. The analysis revealed that while these datasets share a similar macro-level focus, they differ significantly in term usage and labeling across platforms and topics. Findings indicate that lexical breadth and annotation practices are strong predictors of transferability; datasets with diverse and heterogeneous toxic vocabularies perform better across domains, while those with narrow, profanity-focused labeling exhibit substantial generalization gaps, even with high lexical overlap. This highlights the critical role of collection and labeling strategies in curating and evaluating Portuguese hate speech datasets.

Key takeaway

For research scientists developing hate speech detection models for Portuguese, you should prioritize datasets with broad and heterogeneous toxic vocabularies. Datasets with narrow, profanity-centered labeling will likely lead to poor cross-domain generalization, even if they appear lexically similar. Focus on diverse annotation practices to improve model transferability and robustness across different online platforms and topics.

Key insights

Lexical diversity and annotation practices significantly impact cross-domain generalization in hate speech detection datasets.

Principles

Broader toxic vocabulary improves cross-domain performance.
Narrow, profanity-centered labeling creates generalization gaps.

Method

The study used HurtLex-based lexical profiling and cross-dataset evaluation with BERTimbau embeddings and an XGBoost classifier in a feature-based transfer-learning setup.

In practice

Prioritize diverse lexical content in dataset curation.
Avoid overly narrow, profanity-focused annotation.

Topics

Hate Speech Detection
Toxicity Analysis
Brazilian Portuguese
Dataset Analysis
Cross-Domain Generalization

Best for: Research Scientist, AI Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.