Comparison of Modern Multilingual Text Embedding Techniques for Hate Speech Detection Task
Summary
A new study evaluates the effectiveness of modern multilingual sentence embedding models for hate speech detection across Lithuanian, Russian, and English. Researchers introduced LtHate, a new Lithuanian hate speech corpus sourced from news portals and social networks. Six multilingual encoders (potion, gemma, bge, snow, jina, e5) were benchmarked on LtHate, RuToxic, and EnSuperset datasets using a unified Python pipeline. The study trained both one-class HBOS anomaly detectors and two-class CatBoost classifiers, with and without Principal Component Analysis (PCA) compression to 64-dimensional feature vectors. Supervised two-class models significantly outperformed one-class anomaly detection. The best configurations achieved 80.96% accuracy and 0.887 AUC ROC in Lithuanian (jina), 92.19% accuracy and 0.978 AUC ROC in Russian (e5), and 77.21% accuracy and 0.859 AUC ROC in English (e5 with PCA). PCA compression largely preserved discriminative power in supervised settings.
Key takeaway
For AI Engineers developing multilingual content moderation systems, prioritize supervised classification models over anomaly detection. The combination of modern multilingual sentence embeddings like jina or e5 with gradient-boosted decision trees, specifically CatBoost, offers robust performance. Additionally, consider applying PCA to compress embedding vectors to 64 dimensions, as it maintains discriminative power in supervised contexts, optimizing resource usage without sacrificing accuracy.
Key insights
Supervised models with multilingual embeddings and gradient boosted trees excel at hate speech detection.
Principles
- Supervised models outperform anomaly detection.
- PCA can compress embeddings without significant loss.
- Multilingual embeddings support low-resource languages.
Method
Benchmarked six multilingual encoders on three datasets (LtHate, RuToxic, EnSuperset) using CatBoost classifiers and HBOS anomaly detectors, with and without PCA for dimensionality reduction.
In practice
- Use CatBoost for hate speech classification.
- Consider jina or e5 for multilingual embeddings.
- Apply PCA to reduce embedding dimensionality.
Topics
- Multilingual Embeddings
- Hate Speech Detection
- CatBoost Classification
- LtHate Corpus
- PCA Compression
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.