mixup: Beyound Empirical Risk MinimizationPermalink
( https://arxiv.org/pdf/1710.09412.pdf )
ContentsPermalink
- Abstract
- Introduction
- Properties of ERM
- VRM / DA
- Mixup
- from ERM to Mixup
- VRM
- Contribution
0. AbstractPermalink
problem of DL
- memorization
- sensitivity to adversarial examples
$\rightarrow$ propose mixup
Mixup
- trains NN on convex combinations of (x,y) pairs
- regularizes NN to favor simple linear behavior in-between training examples
1. IntroductionPermalink
NN share two commonalities.
-
trained to minimize their average error over the training data
( = Empirical Risk Minimization (ERM) principle )
-
size of SOTA NN scales linearly with the number of training examples ( = N )
( classical result in learning theory ) convergence of ERM is guaranteed as long as the size of the learning machine does not increase with N
- size of a learning machine : measured in terms of its number of parameters ( or VC-complexity )
Properties of ERMPermalink
(1) ERM allows large NN to memorize the training data
- even in the presence of strong regularization
- even in classification problems where the labels are assigned at random
(2) NN trained with ERM change their predictions drastically when evaluated on OOD ( = adversarial examples )
→ ERM is unable to explain or provide generalization on testing distributions that differ only slightly from the training data
Then… alternative of ERM??
VRM / DAPermalink
-
Data Augmentation (DA) : formalized by the Vicinal Risk Minimization (VRM) principle
-
in VRM, human knowledge is required to describe a vicinity or neighborhood around each example in the training data
→ draw additional virtual examples from the vicinity distribution of the training examples
→ enlarge the support of the training distribution.
-
example ) CV
- common to define the vicinity of one image as the set of its horizontal reflections, slight rotations, and mild scalings
-
while DA leads to improved generalization …..
-
dataset-dependent & requires expert knowledge.
-
assumes that examples in the vicinity share the same class,
& does not model the vicinity relation across examples of different classes
-
MixupPermalink
introduce a simple and data-agnostic DA
constructs virtual training examples as …
˜x=λxi+(1−λ)xj, where xi,xj are raw input vectors ˜y=λyi+(1−λ)yj, where yi,yj are one-hot label encodings .
Summary
-
extends the training distribution by incorporating the prior knowledge that…
linear interpolations of feature vectors should lead to linear interpolations of the associated targets
2. from ERM to MixupPermalink
Supervised learning :
- find a function f∈F
- that describes the relationship between a X and Y, which follow the joint distribution P(X,Y).
-
define a loss function ℓ
- that penalizes the differences between f(x) and y, for (x,y)∼P.
-
minimize the average of the loss function ℓ over the data distribution P,
( = Expected Risk , R(f)=∫ℓ(f(x),y)dP(x,y) )
Distribution P is unknown!
- instead, have access to a set of training data D={(xi,yi)}ni=1, where (xi,yi)∼P
- approximate P by the empirical distribution
- Pδ(x,y)=1n∑ni=1δ(x=xi,y=yi).
- where δ(x=xi,y=yi) is a Dirac mass centered at (xi,yi).
- using the Pδ, approximate the expected risk by the empirical risk
- Rδ(f)=∫ℓ(f(x),y)dPδ(x,y)=1n∑ni=1ℓ(f(xi),yi).
- Pδ(x,y)=1n∑ni=1δ(x=xi,y=yi).
→ Empirical Risk Minimization (ERM) principle
Pros & Cons
-
[pros] efficient to compute
-
[cons] monitors the behaviour of f only at a finite set of n examples
→ trivial way : memorize the training data ( = overfitting )
Naïve estimate Pδ is one out of many possible choices to approximate the true distribution P.
- ex) Vicinal Risk Minimization (VRM)
VRMPermalink
- distn P is approximated by Pν(˜x,˜y)=1n∑ni=1ν(˜x,˜y∣xi,yi).
- ν : a vicinity distribution
- measures the probability of finding the virtual feature-target pair (˜x,˜y) in the vicinity of the training feature-target pair (xi,yi).
- ex 1) Gaussian vicinities
- ν(˜x,˜y∣xi,yi)=N(˜x−xi,σ2)δ(˜y=yi),
- which is equivalent to augmenting the training data with additive Gaussian noise.
To learn using VRM …
- (1) sample the vicinal distribution to construct a dataset Dν:={(˜xi,˜yi)}mi=1,
- (2) minimize the empirical vicinal risk: Rν(f)=1m∑mi=1ℓ(f(˜xi),˜yi).
Contribution of this paperPermalink
propose a generic vicinal distribution
μ(˜x,˜y∣xi,yi)=1n∑njEλ[δ(˜x=λ⋅xi+(1−λ)⋅xj,˜y=λ⋅yi+(1−λ)⋅yj)].
-
where λ∼Beta(α,α), for α∈(0,∞).
-
α controls the strength of interpolation between feature-target pairs
( recovers the ERM principle as α→0 )
-
-
˜x=λxi+(1−λ)xj.
-
˜y=λyi+(1−λ)yj.
recovering the ERM principle as α→0.