CATCH-ME if you RAG: a dataset of Contextually Annotated multi-Turn Counterspeech against Hate and Misinformation Exchanges
Summary
The CATCH-ME dataset, "Contextually Annotated multi-Turn Counterspeech against Hate and Misinformation Exchanges," is introduced as the first large-scale, expert-curated, multilingual resource addressing the intersection of online hate speech and misinformation. Developed with 23 domain experts over 18 months, it comprises 2015 dialogues and 12,298 turns, with 6,149 counterspeech turns grounded in external knowledge from fact-checking articles and NGO reports. This dataset covers five languages (English, Italian, Maltese, Polish, and Spanish) and targets hostility towards seven marginalized groups, including Muslims, Jewish people, and LGBTQIA+ individuals. Its document- and chunk-level annotations make it directly applicable for Retrieval-Augmented Generation (RAG) systems, enabling the training and evaluation of more persuasive, factually grounded counterspeech models. The paper also establishes initial benchmarks for retrieval and generation tasks using this novel corpus.
Key takeaway
For NLP engineers developing systems to combat online hate and misinformation, CATCH-ME offers a critical resource. You should utilize this multilingual, multi-turn, knowledge-grounded dataset to train and evaluate Retrieval-Augmented Generation (RAG) models. Prioritize conversational context in retrieval queries and ensure your LLM-generated counterspeech is factually aligned with verified external knowledge, as ungrounded models perform poorly. This dataset enables building more effective and persuasive online moderation tools.
Key insights
CATCH-ME provides a multilingual, multi-turn, knowledge-grounded dataset for countering intertwined hate speech and misinformation.
Principles
- Counterspeech against hate and misinformation requires factual grounding and empathy.
- LLMs need high-quality, multi-turn examples for effective counterspeech generation.
- Human-machine collaboration enhances dataset quality and scalability.
Method
CATCH-ME was collected over 18 months with 23 experts using four human-machine collaboration strategies (pre-compiled, interactive, manual, translation) to create multi-turn dialogues grounded in fact-checking articles and NGO reports.
In practice
- Use CATCH-ME to train RAG systems for fact-based counterspeech.
- Evaluate retrieval models with conversational context queries ($Q_{DC}$).
- Ground LLM-generated counterspeech with verified external knowledge.
Topics
- Counterspeech
- Hate Speech
- Misinformation
- RAG Systems
- Multilingual Datasets
- LLM Generation
- Expert Annotation
Code references
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.