Anatomy of Data Repositories for the Analysis and Detection of Toxicity in Portuguese
Summary
This study analyzes six Brazilian Portuguese datasets designed for hate speech and toxicity detection to understand how their lexical composition and domain characteristics influence cross-domain generalization. Researchers combined HurtLex-based lexical profiling with cross-dataset evaluation using BERTimbau embeddings and an XGBoost classifier in a feature-based transfer-learning setup. The analysis revealed that while these datasets share a similar macro-level focus, they differ significantly in term usage and labeling across platforms and topics. Findings indicate that lexical breadth and annotation practices are strong predictors of transferability; datasets with diverse and heterogeneous toxic vocabularies perform better across domains, while those with narrow, profanity-focused labeling exhibit substantial generalization gaps, even with high lexical overlap. This highlights the critical role of collection and labeling strategies in curating and evaluating Portuguese hate speech datasets.
Key takeaway
For research scientists developing hate speech detection models for Portuguese, you should prioritize datasets with broad and heterogeneous toxic vocabularies. Datasets with narrow, profanity-centered labeling will likely lead to poor cross-domain generalization, even if they appear lexically similar. Focus on diverse annotation practices to improve model transferability and robustness across different online platforms and topics.
Key insights
Lexical diversity and annotation practices significantly impact cross-domain generalization in hate speech detection datasets.
Principles
- Broader toxic vocabulary improves cross-domain performance.
- Narrow, profanity-centered labeling creates generalization gaps.
Method
The study used HurtLex-based lexical profiling and cross-dataset evaluation with BERTimbau embeddings and an XGBoost classifier in a feature-based transfer-learning setup.
In practice
- Prioritize diverse lexical content in dataset curation.
- Avoid overly narrow, profanity-focused annotation.
Topics
- Hate Speech Detection
- Toxicity Analysis
- Brazilian Portuguese
- Dataset Analysis
- Cross-Domain Generalization
Best for: Research Scientist, AI Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.