Adapting Reinforcement Learning with Chain-of-Thought Supervision for Explainable Detection of Hateful and Propagandistic Memes

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, quick

Summary

A new reinforcement learning-based post-training method enhances thinking-based multimodal large language models (MLLMs) for explainable detection of hateful and propagandistic memes. This approach, utilizing task-specific rewards and Group Relative Policy Optimization (GRPO), improves both classification performance and reference-based explanation quality. Researchers conducted an empirical study of off-the-shelf MLLMs on English and Arabic benchmarks, extending existing meme datasets with weakly supervised Chain-of-Thought (CoT) rationales and multi-LLM propaganda annotations. The method introduces a GRPO-based objective with thinking-length regularization, jointly optimizing accuracy and explanation quality, and explores self-supervised GRPO on unlabeled memes. Experiments on Hateful Memes and ArMeme benchmarks show FHM accuracy improvements up to +2.1% (from 79.9% to 82.0%) and ArMeme macro-F1 up to +7.6 points (from 0.536 to 0.612). The system also generates natural-language explanations, offering more balanced per-class performance on ArMeme compared to sequence-classification baselines.

Key takeaway

For Machine Learning Engineers developing content moderation systems, this research indicates that integrating reinforcement learning with Chain-of-Thought supervision can significantly boost MLLM accuracy and explanation quality for hateful memes. You should consider adapting GRPO-based post-training methods to fine-tune your multimodal models, especially when explainability is critical. This approach offers balanced per-class performance, which is vital for robust detection across diverse harmful content.

Key insights

Reinforcement learning with CoT supervision significantly improves MLLM performance and explainability for hateful meme detection.

Principles

Method

A GRPO-based objective with thinking-length regularization jointly optimizes MLLM classification accuracy and explanation quality, using task-specific rewards and weakly supervised CoT rationales. Self-supervised GRPO with pseudo-labels is also explored.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.