Adapting Reinforcement Learning with Chain-of-Thought Supervision for Explainable Detection of Hateful and Propagandistic Memes
Summary
A new reinforcement learning-based post-training method enhances thinking-based multimodal large language models (MLLMs) for explainable detection of hateful and propagandistic memes. This approach, utilizing task-specific rewards and Group Relative Policy Optimization (GRPO), improves both classification performance and reference-based explanation quality. Researchers conducted an empirical study of off-the-shelf MLLMs on English and Arabic benchmarks, extending existing meme datasets with weakly supervised Chain-of-Thought (CoT) rationales and multi-LLM propaganda annotations. The method introduces a GRPO-based objective with thinking-length regularization, jointly optimizing accuracy and explanation quality, and explores self-supervised GRPO on unlabeled memes. Experiments on Hateful Memes and ArMeme benchmarks show FHM accuracy improvements up to +2.1% (from 79.9% to 82.0%) and ArMeme macro-F1 up to +7.6 points (from 0.536 to 0.612). The system also generates natural-language explanations, offering more balanced per-class performance on ArMeme compared to sequence-classification baselines.
Key takeaway
For Machine Learning Engineers developing content moderation systems, this research indicates that integrating reinforcement learning with Chain-of-Thought supervision can significantly boost MLLM accuracy and explanation quality for hateful memes. You should consider adapting GRPO-based post-training methods to fine-tune your multimodal models, especially when explainability is critical. This approach offers balanced per-class performance, which is vital for robust detection across diverse harmful content.
Key insights
Reinforcement learning with CoT supervision significantly improves MLLM performance and explainability for hateful meme detection.
Principles
- Memes require multimodal understanding.
- CoT rationales enhance MLLM explainability.
- RL can optimize classification and explanation.
Method
A GRPO-based objective with thinking-length regularization jointly optimizes MLLM classification accuracy and explanation quality, using task-specific rewards and weakly supervised CoT rationales. Self-supervised GRPO with pseudo-labels is also explored.
In practice
- Extend meme datasets with CoT rationales.
- Apply GRPO for MLLM fine-tuning.
- Use self-supervised GRPO on unlabeled data.
Topics
- Reinforcement Learning
- Chain-of-Thought
- Multimodal LLMs
- Hateful Meme Detection
- Explainable AI
- Group Relative Policy Optimization
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.