Paper 4a - Towards Deep Learning Models Resistant to Adversarial Attacks

Adversarial inputs = almost indistinguishable from natural data yet classified incorrectly by the network.

Untitled

Research at the time suggested that adversarial attacks may be an inherent weakness of deep networks.

Problem addressed through robust optimisation.

To get to fully resistant deep learning models, one of the main steps is robustness against well defined classes of(strong) adversaries.

So the question is:

How can we train deep neural networks that are robust to adversarial inputs?

Benign input = an input that the model is supposed to correctly classify, valid input

So we can frame the problem using a natural saddle point formulation which has 2 benefits:

we can be precise about the guarantee = class of attacks we want to be resistant to
cast both attacks and defenses into a common framework which allows us to encapsulate most prior work on adversarial examples

<aside> ❗ So we find that adversarial training $\propto$ optimising the saddle point problem.

</aside>

Explores the optimisation of saddle point problem which can be solved using first order methods

<aside> 💡 First order method = anything using info from the first derivative to optimise like gradient descent, SGD etc.

</aside>

Using these insights, it motivates PGD(Projected Gradient Descent) as a univeral first order adversary = strongest attack using the first order information about the network.
Model capacity plays an important role in withstanding strong adversarial attacks.
Trained on MNIST and CIFAR10 datasets using PGD as a reliable first order adversary to yeild excellent results.
1. On white box attacks(where the tester knows of the model) with iterative adversary(where the tester feeds multiple adversarial examples)
  1. MNIST = 89%
  2. CIFAR10 = 46%
2. On black box attacks
  1. MNIST = 95%
  2. CIFAR10 = 64%