When Safety Fails Before the Answer: Benchmarking Harmful Behavior Detection in Reasoning Chains

2026-04-21 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, quick

Summary

Researchers have introduced HarmThoughts, a new benchmark designed for step-wise safety evaluation of reasoning traces in large reasoning models (LRMs). This benchmark addresses a critical gap in current safety evaluations, which typically focus only on final outputs and overlook the emergence of harm during multi-step reasoning. HarmThoughts is built upon a taxonomy of 16 harmful reasoning behaviors, categorized into four functional groups, which describe how harm propagates rather than just its end result. The dataset comprises 56,931 sentences from 1,018 reasoning traces, generated by four distinct model families, each annotated with fine-grained, sentence-level behavioral labels. Analysis using HarmThoughts reveals common behavioral trajectories and "drift points" where reasoning transitions from safe to unsafe. Initial evaluations show that existing white-box and black-box detectors struggle with this fine-grained behavior detection, especially for subtle categories related to harm emergence and execution.

Key takeaway

For research scientists developing or deploying large reasoning models, understanding the propagation of harmful behaviors within reasoning chains is crucial. You should integrate process-level safety monitoring to identify "drift points" where reasoning transitions from safe to unsafe, rather than solely relying on final output evaluations. This approach will enable more reliable safety interventions and systematic failure diagnosis in your models.

Key insights

HarmThoughts benchmarks step-wise harmful behavior in LRM reasoning traces, revealing how harm propagates beyond final outputs.

Principles

Harm emerges through distinct behavioral steps.
Safety evaluation needs sentence-level granularity.
Harm propagation patterns are identifiable.

Method

HarmThoughts uses a 16-behavior harm taxonomy across four functional groups to annotate 56,931 sentences from 1,018 LRM reasoning traces, enabling fine-grained, step-wise safety evaluation.

In practice

Analyze reasoning traces for drift points.
Develop detectors for nuanced harm categories.
Monitor models for refusal suppression.

Topics

Large Reasoning Models
AI Safety Benchmarking
HarmThoughts Dataset
Harm Taxonomy
Reasoning Chain Analysis

Best for: Research Scientist, AI Scientist, AI Security Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.