Generalised Eigenvalue Geometry of Semantic Adversarial Attacks

2026-06-18 · Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy, Data Science & Analytics · Depth: Expert, extended

Summary

This paper introduces a theoretical framework for semantic adversarial attacks, focusing on a two-model threat model involving proxy and target embeddings. It defines a "local attackability index" (λ*(x)) as the top generalized eigenvalue of a matrix pencil formed by the Jacobians of the two embedders. This index quantifies the worst-case local displacement of the target representation under a proxy budget. The research derives a closed-form prediction-flip condition for affine readouts and develops population-level attackability measures with uniform concentration bounds using VC dimension and fat-shattering margin theory. The study also bridges this continuous theory to discrete paraphrase searches. Empirical verification using FinBERT and Sentence-BERT on the Financial PhraseBank confirms the theoretical inequality Σw(x) ≤ λ*(x) and shows the attackability-adjusted margin Z_w(x) predicts vulnerability with an AUC of approximately 0.91.

Key takeaway

For AI Security Engineers evaluating NLP model robustness, you should diagnose semantic attackability using the proposed local attackability index λ*(x) and the adjusted margin Z_w(x). These metrics, derived from embedding Jacobians, predict vulnerability with high accuracy (AUC ≈ 0.91), offering a more robust assessment than traditional finite-search methods alone. Prioritize Z_w(x) for specific readout vulnerability, as λ*(x) can be overly conservative.

Key insights

Semantic adversarial vulnerability is governed by the relative local geometry of proxy and target embedding models.

Principles

Attackability is quantified by a generalized eigenvalue of embedding Jacobians.
Finite search success unambiguously certifies model vulnerability.
Local robustness requires a sufficiently large attackability-adjusted margin.

Method

A continuous local model of paraphrase perturbations uses Jacobians of proxy and target embedders to form a matrix pencil, whose top generalized eigenvalue defines the attackability index.

In practice

Use λ*(x) to diagnose model fragility to semantic attacks.
Evaluate Z_w(x) as a reliable screening index for vulnerability.
Report finite search failures with search procedure details for context.

Topics

Semantic Adversarial Attacks
Adversarial Robustness Theory
Generalized Eigenvalue Problem
NLP Sentiment Classification
FinBERT
Sentence-BERT

Best for: Research Scientist, AI Scientist, NLP Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.