Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information

2026-02-26 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy, Data Science & Analytics · Depth: Expert, extended

Summary

The SemSIEdit framework introduces an inference-time, dual-agent (Evaluator-Editor) architecture designed to mitigate Semantic Sensitive Information (SemSI) leakage in Large Language Models (LLMs) without resorting to simple refusals. Unlike traditional defenses for Structured PII, SemSIEdit addresses context-dependent sensitive inferences, reputation-harmful content, and incorrect hazardous information. The framework iteratively critiques and rewrites sensitive spans, achieving a 34.6% reduction in semantic leakage across 13 state-of-the-art LLMs, including GPT-5, with only a 9.8% utility loss. The research reveals a "Scale-Dependent Safety Divergence," where larger models like GPT-5 achieve safety through constructive expansion and nuance, while smaller models resort to destructive truncation. Additionally, a "Reasoning Paradox" is identified: while reasoning increases baseline risk, it simultaneously empowers SemSIEdit to execute safe rewrites and restore epistemic calibration by shifting models from confident hallucination to qualified responses.

Key takeaway

For research scientists developing LLM safety mechanisms, you should consider implementing agentic rewriting frameworks like SemSIEdit to address Semantic Sensitive Information. This approach significantly reduces privacy risks and hallucinations while maintaining high utility, especially with larger, more capable models. Your focus should shift from simple content refusal to sophisticated, context-aware content transformation to enhance both safety and user experience in high-stakes applications.

Key insights

Agentic rewriting effectively mitigates semantic sensitive information in LLMs while preserving utility, outperforming refusal-based defenses.

Principles

Semantic privacy requires reasoning, not just pattern matching.
Larger LLMs achieve safety through nuanced rewriting, smaller ones through truncation.
Reasoning can amplify both risk and defense efficacy.

Method

SemSIEdit uses an iterative Evaluator-Editor loop at inference time. An Evaluator agent identifies SemSI, and an Editor agent rewrites sensitive content to maintain narrative flow and utility, converging when no leakage is detected or a budget is met.

In practice

Implement dual-agent architectures for nuanced content moderation.
Prioritize larger models for agentic self-correction tasks.
Focus on rewriting sensitive spans over outright refusal.

Topics

Semantic Sensitive Information
Agentic LLM Defenses
Privacy-Utility Trade-off
LLM Self-Correction
Epistemic Calibration

Best for: Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.