Comparison of Modern Multilingual Text Embedding Techniques for Hate Speech Detection Task

2026-04-16 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

A new study evaluates the effectiveness of modern multilingual sentence embedding models for hate speech detection across Lithuanian, Russian, and English. Researchers introduced LtHate, a new Lithuanian hate speech corpus sourced from news portals and social networks. Six multilingual encoders (potion, gemma, bge, snow, jina, e5) were benchmarked on LtHate, RuToxic, and EnSuperset datasets using a unified Python pipeline. The study trained both one-class HBOS anomaly detectors and two-class CatBoost classifiers, with and without Principal Component Analysis (PCA) compression to 64-dimensional feature vectors. Supervised two-class models significantly outperformed one-class anomaly detection. The best configurations achieved 80.96% accuracy and 0.887 AUC ROC in Lithuanian (jina), 92.19% accuracy and 0.978 AUC ROC in Russian (e5), and 77.21% accuracy and 0.859 AUC ROC in English (e5 with PCA). PCA compression largely preserved discriminative power in supervised settings.

Key takeaway

For AI Engineers developing multilingual content moderation systems, prioritize supervised classification models over anomaly detection. The combination of modern multilingual sentence embeddings like jina or e5 with gradient-boosted decision trees, specifically CatBoost, offers robust performance. Additionally, consider applying PCA to compress embedding vectors to 64 dimensions, as it maintains discriminative power in supervised contexts, optimizing resource usage without sacrificing accuracy.

Key insights

Supervised models with multilingual embeddings and gradient boosted trees excel at hate speech detection.

Principles

Supervised models outperform anomaly detection.
PCA can compress embeddings without significant loss.
Multilingual embeddings support low-resource languages.

Method

Benchmarked six multilingual encoders on three datasets (LtHate, RuToxic, EnSuperset) using CatBoost classifiers and HBOS anomaly detectors, with and without PCA for dimensionality reduction.

In practice

Use CatBoost for hate speech classification.
Consider jina or e5 for multilingual embeddings.
Apply PCA to reduce embedding dimensionality.

Topics

Multilingual Embeddings
Hate Speech Detection
CatBoost Classification
LtHate Corpus
PCA Compression

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.