When Safety Fails Before the Answer: Benchmarking Harmful Behavior Detection in Reasoning Chains
Summary
Researchers have introduced HarmThoughts, a new benchmark designed for step-wise safety evaluation of reasoning traces in large reasoning models (LRMs). This benchmark addresses a critical gap in current safety evaluations, which typically focus only on final outputs and overlook the emergence of harm during multi-step reasoning. HarmThoughts is built upon a taxonomy of 16 harmful reasoning behaviors, categorized into four functional groups, which describe how harm propagates rather than just its end result. The dataset comprises 56,931 sentences from 1,018 reasoning traces, generated by four distinct model families, each annotated with fine-grained, sentence-level behavioral labels. Analysis using HarmThoughts reveals common behavioral trajectories and "drift points" where reasoning transitions from safe to unsafe. Initial evaluations show that existing white-box and black-box detectors struggle with this fine-grained behavior detection, especially for subtle categories related to harm emergence and execution.
Key takeaway
For research scientists developing or deploying large reasoning models, understanding the propagation of harmful behaviors within reasoning chains is crucial. You should integrate process-level safety monitoring to identify "drift points" where reasoning transitions from safe to unsafe, rather than solely relying on final output evaluations. This approach will enable more reliable safety interventions and systematic failure diagnosis in your models.
Key insights
HarmThoughts benchmarks step-wise harmful behavior in LRM reasoning traces, revealing how harm propagates beyond final outputs.
Principles
- Harm emerges through distinct behavioral steps.
- Safety evaluation needs sentence-level granularity.
- Harm propagation patterns are identifiable.
Method
HarmThoughts uses a 16-behavior harm taxonomy across four functional groups to annotate 56,931 sentences from 1,018 LRM reasoning traces, enabling fine-grained, step-wise safety evaluation.
In practice
- Analyze reasoning traces for drift points.
- Develop detectors for nuanced harm categories.
- Monitor models for refusal suppression.
Topics
- Large Reasoning Models
- AI Safety Benchmarking
- HarmThoughts Dataset
- Harm Taxonomy
- Reasoning Chain Analysis
Best for: Research Scientist, AI Scientist, AI Security Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.