Improving Adversarial Robustness of Attribution via Implicit Regularization
Summary
A recent study demonstrates that the adversarial robustness of attributions in deep learning can emerge implicitly from the learning dynamics of standard stochastic gradient descent (SGD). This effect, theoretically motivated by connections between parameter-space and input-space curvature, was validated across various architectures, datasets, and attribution methods, incurring negligible computational overhead. The research further reveals that these robustness gains typically do not transfer to attention-based attribution under softmax normalization, a limitation attributed to inherent entropy constraints and confirmed experimentally. However, replacing softmax attention with kernel-based attention successfully restores these robustness benefits in transformer models. This work identifies learning dynamics as a principled and practical mechanism for achieving robust explainability and uncovers fundamental limitations within softmax-normalized attention-based attribution.
Key takeaway
For Machine Learning Engineers designing robust explainable AI systems, particularly those using Transformers, you should investigate utilizing standard SGD learning dynamics to achieve attribution robustness implicitly, reducing computational overhead. Be aware that softmax normalization in attention mechanisms can inherently limit these robustness gains; consider implementing kernel-based attention instead to restore robustness in your transformer models, ensuring more reliable explanations.
Key insights
Attribution robustness in deep learning can arise implicitly from SGD learning dynamics, without explicit regularization.
Principles
- Attribution robustness connects to parameter-space and input-space curvature.
- Softmax normalization can limit robustness gains in attention-based attribution.
- Kernel-based attention can restore robustness in Transformers.
Method
Robustness emerges from standard SGD learning dynamics, validated across architectures, datasets, and attribution methods.
In practice
- Utilize SGD learning dynamics for implicit attribution robustness.
- Consider kernel-based attention for robust Transformers.
- Avoid softmax normalization for robust attention-based attributions.
Topics
- Adversarial Robustness
- Attribution Methods
- Explainable AI
- Stochastic Gradient Descent
- Transformer Models
- Kernel-based Attention
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.