Walls, Shields, and Illusions: Defenses and Their Limits
Summary
This analysis explores three common defenses against adversarial attacks on a Convolutional Neural Network (CNN) trained on MNIST: FGSM adversarial training, PGD adversarial training, and defensive distillation. A baseline model achieved 99.00% clean accuracy, degrading to 96.60% under FGSM (ε=0.3) and 95.62% under PGD (ε=0.3). FGSM-trained models showed 99.08% clean accuracy, 97.97% against FGSM, and 97.61% against PGD. PGD-trained models also achieved 99.08% clean accuracy, 97.52% against FGSM, and 97.27% against PGD. Defensive distillation proved the weakest, with PGD accuracy varying between 93.68% and 95.43%. The study concludes that while these defenses improve robustness, they do not eliminate the model's "silence" – its overconfidence in misclassified adversarial examples, as demonstrated by a PGD-trained model misclassifying an image with 0.6187 confidence. This highlights the continuous "work → break → adapt" cycle in adversarial machine learning.
Key takeaway
For Machine Learning Engineers building robust models, understand that current adversarial defenses like PGD training raise the bar but do not eliminate model overconfidence in misclassifications. You should integrate awareness mechanisms beyond simple confidence thresholds, as a model's "hesitation" (e.g., 0.6187 confidence) may still be too high to flag errors. Focus on developing models that are "undeluded" rather than just "unbreakable."
Key insights
Adversarial defenses raise robustness but models remain overconfident and "silent" about their limits, perpetuating an arms race.
Principles
- Adversarial defenses follow a "work → break → adapt" cycle.
- Robustness training may not always incur a clean accuracy trade-off.
- Model "silence" (overconfidence in errors) persists despite defenses.
Method
The study implemented FGSM and PGD adversarial training, and defensive distillation on a CNN for MNIST. Models were evaluated against FGSM and PGD attacks across ε=0.00 to 0.40.
In practice
- Augment training data with FGSM-generated adversarial examples.
- Use PGD adversarial training for stronger, standard robustness.
- Avoid relying solely on confidence scores for adversarial detection.
Topics
- Adversarial Robustness
- Adversarial Training
- Defensive Distillation
- Model Confidence
- Machine Learning Security
- MNIST Classification
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Towards AI - Medium.