Self-supervised Label Augmentation via Input TransformationsPermalink
ContentsPermalink
- Abstract
- Introduction
- Self-supervised Label Augmentation (SLA)
- Multi-task Learning with Self-supervision
- Eliminating Invariance via Joint-label Classifier
0. AbstractPermalink
self-supervised learning
- constructs artificial labels
This paper :
-
constructing artifical labels works well even under fully labeled datasets
-
main idea : learn a single unified task w.r.t joint distn of the (1) original labels & (2) self-supervised labels
( augment these 2 labels )
-
propose a novel knowledge transfser technique ( = self-distillation )
→ faster inference!
1. IntroductionPermalink
ContributionPermalink
-
multi-task approach :
-
enforcing invariance to transformation
→ may lead to bad result in some cases
-
-
propose a simple & effective algorithm
- learn a single unified task
- use joint distribution of original & self-supervised labels
Self-supervised Label Augmentation (SLA)Permalink
-
proposed label augmentation method
-
do not force any invariance to tranfsormation
-
assign different labels for each transformation
→ possible to make a prediction by aggregation
( act as an ensemble )
-
propose a self-distillation technique
( transfers knowledge of the multiple inferences into a single inference )
2. Self-supervised Label Augmentation (SLA)Permalink
( setting : fully-supervised scenario )
-
problems of conventional multi-task learning approach
-
introduce proposed algorithm
( + 2 additional techniques : (1) aggregation & (2) self-distillation )
-
(1) aggregation :
-
uses all differently augmented samples
-
provide an ensemble effect, using a single model
-
-
(2) self-distillation :
- transfers the aggregated knowledge into the model itself for acceleration
-
NotationPermalink
- input : x∈Rd
- number of classes : N
- Cross entropy loss : LCE
- softmax classifier : σ(⋅;u)
- σi(z;u)=exp(u⊤iz)/∑kexp(u⊤kz).
- embedding vector : z=f(x;θ)
- augmented sample : ˜x=t(x)
- embedding vector of augmented sample : ˜z=f(˜x;θ)
(1) Multi-task Learning with Self-supervisionPermalink
transformation-based self-supervised learning
- learn to predict which transformation is applied
- usually use 2 losses (multi-task learning)
Loss function :
LMT(x,y;θ,u,v)=1MM∑j=1LCE(σ(˜zj;u),y)+LCE(σ(˜zj;v),j).
- {tj}Mj=1 : pre-defined transformations
- ˜xj=tj(x) : transformed sample by tj
- ˜zj=f(˜xj;θ) : embedding
- σ(⋅;u) : classifier ( for primary task )
- σ(⋅;v) : classifier ( for self-supervised task )
→ forces the primary classifier to be invariant to transformation
LIMITATION
-
depending on the type of transformation….
→ might hurt performance!
-
ex) rotation … number 6 & 9
(2) Eliminating Invariance via Joint-label ClassifierPermalink
-
remove the unnecessary INVARIANT propoerty of the classifier σ(f(⋅);u)
-
instead, use a JOINT softmax classifier ρ(⋅;w)
- joint probability : P(i,j∣˜x)=ρij(˜z;w)=exp(w⊤ij˜z)/∑k,lexp(w⊤kl˜z)
-
Loss function : LSLA(x,y;θ,w)=1M∑Mj=1LCE(ρ(˜zj;w),(y,j))
- LCE(ρ(˜z;w),(i,j))=−logρij(˜z;w).
-
only increases the number of labels
( number of additional parameters are very small )
-
LMT and LSLA :
-
consider the same set of multi-labels
Aggregated InferencePermalink
-
do not need to consider N×M labels
( because we already know which transformation is applied )
-
make prediction using conditional probability
- P(i∣˜xj,j)=exp(w⊤ij˜zj)/∑kexp(w⊤kj˜zj).
-
for all possible transformations…
- aggregate the conditonal probabilities!
- acts as an ensemble model
- Paggregated (i∣x)=exp(si)∑Nk=1exp(sk).
Self-distillation from aggregationPermalink
-
requires to compute ˜zj=f(˜xj) for all j
→ M times higher computation cost than single inference
-
Solution : perform self-distillation
-
from ) aggregated knowledge Paggregated (⋅∣x)
-
to ) another classifier σ(f(x;θ);u)
→ can maintain the aggregated knowledge, using only z=f(x)
-
-
objective function :
LSLA+SD(x,y;θ,w,u)=LSLA(x,y;θ,w)+DKL(Paggregated (⋅∣x)∣∣σ(z;u))+βLCE(σ(z;u),y)