Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy, Data Science & Analytics · Depth: Expert, extended

Summary

The SemSIEdit framework introduces an inference-time, dual-agent (Evaluator-Editor) architecture designed to mitigate Semantic Sensitive Information (SemSI) leakage in Large Language Models (LLMs) without resorting to simple refusals. Unlike traditional defenses for Structured PII, SemSIEdit addresses context-dependent sensitive inferences, reputation-harmful content, and incorrect hazardous information. The framework iteratively critiques and rewrites sensitive spans, achieving a 34.6% reduction in semantic leakage across 13 state-of-the-art LLMs, including GPT-5, with only a 9.8% utility loss. The research reveals a "Scale-Dependent Safety Divergence," where larger models like GPT-5 achieve safety through constructive expansion and nuance, while smaller models resort to destructive truncation. Additionally, a "Reasoning Paradox" is identified: while reasoning increases baseline risk, it simultaneously empowers SemSIEdit to execute safe rewrites and restore epistemic calibration by shifting models from confident hallucination to qualified responses.

Key takeaway

For research scientists developing LLM safety mechanisms, you should consider implementing agentic rewriting frameworks like SemSIEdit to address Semantic Sensitive Information. This approach significantly reduces privacy risks and hallucinations while maintaining high utility, especially with larger, more capable models. Your focus should shift from simple content refusal to sophisticated, context-aware content transformation to enhance both safety and user experience in high-stakes applications.

Key insights

Agentic rewriting effectively mitigates semantic sensitive information in LLMs while preserving utility, outperforming refusal-based defenses.

Principles

Method

SemSIEdit uses an iterative Evaluator-Editor loop at inference time. An Evaluator agent identifies SemSI, and an Editor agent rewrites sensitive content to maintain narrative flow and utility, converging when no leakage is detected or a budget is met.

In practice

Topics

Best for: Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.