From Prompt Risk to Response Risk: Paired Analysis of Safety Behavior of Large Language Model

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy, Data Science & Analytics · Depth: Expert, medium

Summary

A Microsoft Responsible AI study introduces a paired, transition-based analysis for evaluating Large Language Model (LLM) safety, moving beyond binary outcomes to assess how risk changes between user prompts and model responses. Analyzing 1,250 human-labeled prompt-response records across four harm categories (Hate, Sexual, Violence, Self-harm) and ordinal severity levels, the research found that 61% of responses de-escalate harm, 36% preserve severity, and 3% escalate it. Sexual content proved three times harder to de-escalate than Hate or Violence, primarily due to persistence on already-sexual prompts. Joint analysis with response relevance revealed that all compliance-escalation cases from non-zero prompts were highly relevant (relevance-3), while medium-severity responses showed the lowest relevance (64%), often due to tangential elaborations in Violence and Sexual categories.

Key takeaway

For AI safety researchers and LLM developers designing moderation systems, this analysis highlights the need to move beyond simple output filtering. Your focus should shift to understanding prompt-to-response risk transitions and the interplay with relevance. Specifically, prioritize developing targeted de-escalation mechanisms for persistent harm categories like Sexual content and implement robust response-side moderation, as 80% of escalations originate from benign prompts.

Key insights

LLM safety analysis benefits from paired prompt-to-response severity transitions and joint relevance assessment.

Principles

Method

A paired, transition-based analysis using independently human-labeled prompt and response severity (0-3) across four harm categories, combined with response relevance scoring (1-3), to track risk changes.

In practice

Topics

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.