Generalised Eigenvalue Geometry of Semantic Adversarial Attacks
Summary
This paper introduces a theoretical framework for semantic adversarial attacks, focusing on a two-model threat model involving proxy and target embeddings. It defines a "local attackability index" (λ*(x)) as the top generalized eigenvalue of a matrix pencil formed by the Jacobians of the two embedders. This index quantifies the worst-case local displacement of the target representation under a proxy budget. The research derives a closed-form prediction-flip condition for affine readouts and develops population-level attackability measures with uniform concentration bounds using VC dimension and fat-shattering margin theory. The study also bridges this continuous theory to discrete paraphrase searches. Empirical verification using FinBERT and Sentence-BERT on the Financial PhraseBank confirms the theoretical inequality Σw(x) ≤ λ*(x) and shows the attackability-adjusted margin Z_w(x) predicts vulnerability with an AUC of approximately 0.91.
Key takeaway
For AI Security Engineers evaluating NLP model robustness, you should diagnose semantic attackability using the proposed local attackability index λ*(x) and the adjusted margin Z_w(x). These metrics, derived from embedding Jacobians, predict vulnerability with high accuracy (AUC ≈ 0.91), offering a more robust assessment than traditional finite-search methods alone. Prioritize Z_w(x) for specific readout vulnerability, as λ*(x) can be overly conservative.
Key insights
Semantic adversarial vulnerability is governed by the relative local geometry of proxy and target embedding models.
Principles
- Attackability is quantified by a generalized eigenvalue of embedding Jacobians.
- Finite search success unambiguously certifies model vulnerability.
- Local robustness requires a sufficiently large attackability-adjusted margin.
Method
A continuous local model of paraphrase perturbations uses Jacobians of proxy and target embedders to form a matrix pencil, whose top generalized eigenvalue defines the attackability index.
In practice
- Use λ*(x) to diagnose model fragility to semantic attacks.
- Evaluate Z_w(x) as a reliable screening index for vulnerability.
- Report finite search failures with search procedure details for context.
Topics
- Semantic Adversarial Attacks
- Adversarial Robustness Theory
- Generalized Eigenvalue Problem
- NLP Sentiment Classification
- FinBERT
- Sentence-BERT
Best for: Research Scientist, AI Scientist, NLP Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.