From Prompt Risk to Response Risk: Paired Analysis of Safety Behavior of Large Language Model

2025-07-18 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy, Data Science & Analytics · Depth: Expert, medium

Summary

A Microsoft Responsible AI study introduces a paired, transition-based analysis for evaluating Large Language Model (LLM) safety, moving beyond binary outcomes to assess how risk changes between user prompts and model responses. Analyzing 1,250 human-labeled prompt-response records across four harm categories (Hate, Sexual, Violence, Self-harm) and ordinal severity levels, the research found that 61% of responses de-escalate harm, 36% preserve severity, and 3% escalate it. Sexual content proved three times harder to de-escalate than Hate or Violence, primarily due to persistence on already-sexual prompts. Joint analysis with response relevance revealed that all compliance-escalation cases from non-zero prompts were highly relevant (relevance-3), while medium-severity responses showed the lowest relevance (64%), often due to tangential elaborations in Violence and Sexual categories.

Key takeaway

For AI safety researchers and LLM developers designing moderation systems, this analysis highlights the need to move beyond simple output filtering. Your focus should shift to understanding prompt-to-response risk transitions and the interplay with relevance. Specifically, prioritize developing targeted de-escalation mechanisms for persistent harm categories like Sexual content and implement robust response-side moderation, as 80% of escalations originate from benign prompts.

Key insights

LLM safety analysis benefits from paired prompt-to-response severity transitions and joint relevance assessment.

Principles

Safety evaluation should track risk transitions, not just endpoint outcomes.
Harm categories exhibit asymmetric de-escalation difficulty.
Helpfulness-harmlessness tradeoff manifests as high-relevance escalations.

Method

A paired, transition-based analysis using independently human-labeled prompt and response severity (0-3) across four harm categories, combined with response relevance scoring (1-3), to track risk changes.

In practice

Implement response-side moderation for severity-0 prompt escalations.
Prioritize de-escalation strategies for Sexual content.
Monitor medium-severity responses for tangential elaborations.

Topics

LLM Safety Evaluation
Prompt-Response Analysis
Harm De-escalation
Content Moderation Taxonomies
Helpfulness-Harmlessness Tradeoff

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.