Beyond Social Pressure: Benchmarking Epistemic Attack in Large Language Models

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, extended

Summary

Researchers introduced PPT-Bench, a diagnostic benchmark designed to evaluate Large Language Models' (LLMs) susceptibility to "epistemic attack," a failure mode where prompts challenge the legitimacy of knowledge, values, or identity. Unlike prior work on sycophancy focusing on social pressure, PPT-Bench uses the Philosophical Pressure Taxonomy (PPT) to define four types of philosophical pressure: Epistemic Destabilization, Value Nullification, Authority Inversion, and Identity Dissolution. The benchmark includes 90 seed items across five domains, tested at three layers: a baseline (L0), single-turn pressure (L1), and multi-turn Socratic escalation (L2). Evaluations across eight models, including Nemotron 3 Super 120B and Ministral 8B, revealed statistically separable inconsistency patterns across pressure types. Nemotron 3 Super 120B was the most stable overall (23.3% capitulation rate), while Ministral 8B was the most susceptible (86.7%). Mitigation strategies, such as prompt-level anchoring and Leading Query Contrastive Decoding, showed type- and model-dependent effectiveness.

Key takeaway

For AI Engineers developing robust LLMs, understanding and mitigating epistemic attack is crucial for model reliability. Your models may appear stable under social pressure but collapse under philosophical challenges to knowledge or identity. Implement diagnostic benchmarks like PPT-Bench to identify specific vulnerabilities and apply targeted interventions, such as prompt-level epistemic anchors or Leading Query Contrastive Decoding, to ensure principled and consistent responses in critical applications.

Key insights

LLMs exhibit distinct vulnerabilities to philosophical pressure, requiring targeted diagnostic benchmarks and mitigation strategies.

Principles

Method

PPT-Bench evaluates LLM epistemic attack using a Philosophical Pressure Taxonomy (PPT) across three layers: baseline, single-turn pressure, and multi-turn Socratic escalation, measuring inconsistency and conversational capitulation.

In practice

Topics

Best for: AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.