mixup: Beyound Empirical Risk MinimizationPermalink

( https://arxiv.org/pdf/1710.09412.pdf )


ContentsPermalink

  1. Abstract
  2. Introduction
    1. Properties of ERM
    2. VRM / DA
    3. Mixup
  3. from ERM to Mixup
    1. VRM
    2. Contribution


0. AbstractPermalink

problem of DL

  • memorization
  • sensitivity to adversarial examples

$\rightarrow$ propose mixup


Mixup

  • trains NN on convex combinations of (x,y) pairs
  • regularizes NN to favor simple linear behavior in-between training examples


1. IntroductionPermalink

NN share two commonalities.

  1. trained to minimize their average error over the training data

    ( = Empirical Risk Minimization (ERM) principle )

  2. size of SOTA NN scales linearly with the number of training examples ( = N )


( classical result in learning theory ) convergence of ERM is guaranteed as long as the size of the learning machine does not increase with N

  • size of a learning machine : measured in terms of its number of parameters ( or VC-complexity )


Properties of ERMPermalink

(1) ERM allows large NN to memorize the training data

  • even in the presence of strong regularization
  • even in classification problems where the labels are assigned at random

(2) NN trained with ERM change their predictions drastically when evaluated on OOD ( = adversarial examples )

ERM is unable to explain or provide generalization on testing distributions that differ only slightly from the training data

Then… alternative of ERM??


VRM / DAPermalink

  • Data Augmentation (DA) : formalized by the Vicinal Risk Minimization (VRM) principle

  • in VRM, human knowledge is required to describe a vicinity or neighborhood around each example in the training data

    draw additional virtual examples from the vicinity distribution of the training examples

    enlarge the support of the training distribution.

  • example ) CV

    • common to define the vicinity of one image as the set of its horizontal reflections, slight rotations, and mild scalings
  • while DA leads to improved generalization …..

    • dataset-dependent & requires expert knowledge.

    • assumes that examples in the vicinity share the same class,

      & does not model the vicinity relation across examples of different classes


MixupPermalink

introduce a simple and data-agnostic DA

constructs virtual training examples as …

˜x=λxi+(1λ)xj, where xi,xj are raw input vectors ˜y=λyi+(1λ)yj, where yi,yj are one-hot label encodings .


Summary

  • extends the training distribution by incorporating the prior knowledge that…

    linear interpolations of feature vectors should lead to linear interpolations of the associated targets


2. from ERM to MixupPermalink

Supervised learning :

  • find a function fF
    • that describes the relationship between a X and Y, which follow the joint distribution P(X,Y).
  • define a loss function

    • that penalizes the differences between f(x) and y, for (x,y)P.
  • minimize the average of the loss function over the data distribution P,

    ( = Expected Risk , R(f)=(f(x),y)dP(x,y) )


Distribution P is unknown!

  • instead, have access to a set of training data D={(xi,yi)}ni=1, where (xi,yi)P
  • approximate P by the empirical distribution
    • Pδ(x,y)=1nni=1δ(x=xi,y=yi).
      • where δ(x=xi,y=yi) is a Dirac mass centered at (xi,yi).
    • using the Pδ, approximate the expected risk by the empirical risk
      • Rδ(f)=(f(x),y)dPδ(x,y)=1nni=1(f(xi),yi).

Empirical Risk Minimization (ERM) principle


Pros & Cons

  • [pros] efficient to compute

  • [cons] monitors the behaviour of f only at a finite set of n examples

    trivial way : memorize the training data ( = overfitting )


figure2


Naïve estimate Pδ is one out of many possible choices to approximate the true distribution P.

  • ex) Vicinal Risk Minimization (VRM)


VRMPermalink

  • distn P is approximated by Pν(˜x,˜y)=1nni=1ν(˜x,˜yxi,yi).
  • ν : a vicinity distribution
    • measures the probability of finding the virtual feature-target pair (˜x,˜y) in the vicinity of the training feature-target pair (xi,yi).
  • ex 1) Gaussian vicinities
    • ν(˜x,˜yxi,yi)=N(˜xxi,σ2)δ(˜y=yi),
    • which is equivalent to augmenting the training data with additive Gaussian noise.


To learn using VRM …

  • (1) sample the vicinal distribution to construct a dataset Dν:={(˜xi,˜yi)}mi=1,
  • (2) minimize the empirical vicinal risk: Rν(f)=1mmi=1(f(˜xi),˜yi).


Contribution of this paperPermalink

propose a generic vicinal distribution

μ(˜x,˜yxi,yi)=1nnjEλ[δ(˜x=λxi+(1λ)xj,˜y=λyi+(1λ)yj)].

  • where λBeta(α,α), for α(0,).

    • α controls the strength of interpolation between feature-target pairs

      ( recovers the ERM principle as α0 )

  • ˜x=λxi+(1λ)xj.

  • ˜y=λyi+(1λ)yj.

recovering the ERM principle as α0.

Categories: , ,

Updated: