Walls, Shields, and Illusions: Defenses and Their Limits

2026-06-21 · Source: Towards AI - Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Advanced, long

Summary

This analysis explores three common defenses against adversarial attacks on a Convolutional Neural Network (CNN) trained on MNIST: FGSM adversarial training, PGD adversarial training, and defensive distillation. A baseline model achieved 99.00% clean accuracy, degrading to 96.60% under FGSM (ε=0.3) and 95.62% under PGD (ε=0.3). FGSM-trained models showed 99.08% clean accuracy, 97.97% against FGSM, and 97.61% against PGD. PGD-trained models also achieved 99.08% clean accuracy, 97.52% against FGSM, and 97.27% against PGD. Defensive distillation proved the weakest, with PGD accuracy varying between 93.68% and 95.43%. The study concludes that while these defenses improve robustness, they do not eliminate the model's "silence" – its overconfidence in misclassified adversarial examples, as demonstrated by a PGD-trained model misclassifying an image with 0.6187 confidence. This highlights the continuous "work → break → adapt" cycle in adversarial machine learning.

Key takeaway

For Machine Learning Engineers building robust models, understand that current adversarial defenses like PGD training raise the bar but do not eliminate model overconfidence in misclassifications. You should integrate awareness mechanisms beyond simple confidence thresholds, as a model's "hesitation" (e.g., 0.6187 confidence) may still be too high to flag errors. Focus on developing models that are "undeluded" rather than just "unbreakable."

Key insights

Adversarial defenses raise robustness but models remain overconfident and "silent" about their limits, perpetuating an arms race.

Principles

Adversarial defenses follow a "work → break → adapt" cycle.
Robustness training may not always incur a clean accuracy trade-off.
Model "silence" (overconfidence in errors) persists despite defenses.

Method

The study implemented FGSM and PGD adversarial training, and defensive distillation on a CNN for MNIST. Models were evaluated against FGSM and PGD attacks across ε=0.00 to 0.40.

In practice

Augment training data with FGSM-generated adversarial examples.
Use PGD adversarial training for stronger, standard robustness.
Avoid relying solely on confidence scores for adversarial detection.

Topics

Adversarial Robustness
Adversarial Training
Defensive Distillation
Model Confidence
Machine Learning Security
MNIST Classification

Code references

Maee127/Adversarial-Notebooks

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards AI - Medium.