CATCH-ME if you RAG: a dataset of Contextually Annotated multi-Turn Counterspeech against Hate and Misinformation Exchanges

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

The CATCH-ME dataset, "Contextually Annotated multi-Turn Counterspeech against Hate and Misinformation Exchanges," is introduced as the first large-scale, expert-curated, multilingual resource addressing the intersection of online hate speech and misinformation. Developed with 23 domain experts over 18 months, it comprises 2015 dialogues and 12,298 turns, with 6,149 counterspeech turns grounded in external knowledge from fact-checking articles and NGO reports. This dataset covers five languages (English, Italian, Maltese, Polish, and Spanish) and targets hostility towards seven marginalized groups, including Muslims, Jewish people, and LGBTQIA+ individuals. Its document- and chunk-level annotations make it directly applicable for Retrieval-Augmented Generation (RAG) systems, enabling the training and evaluation of more persuasive, factually grounded counterspeech models. The paper also establishes initial benchmarks for retrieval and generation tasks using this novel corpus.

Key takeaway

For NLP engineers developing systems to combat online hate and misinformation, CATCH-ME offers a critical resource. You should utilize this multilingual, multi-turn, knowledge-grounded dataset to train and evaluate Retrieval-Augmented Generation (RAG) models. Prioritize conversational context in retrieval queries and ensure your LLM-generated counterspeech is factually aligned with verified external knowledge, as ungrounded models perform poorly. This dataset enables building more effective and persuasive online moderation tools.

Key insights

CATCH-ME provides a multilingual, multi-turn, knowledge-grounded dataset for countering intertwined hate speech and misinformation.

Principles

Method

CATCH-ME was collected over 18 months with 23 experts using four human-machine collaboration strategies (pre-compiled, interactive, manual, translation) to create multi-turn dialogues grounded in fact-checking articles and NGO reports.

In practice

Topics

Code references

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.