Boosting Few-Shot Visual Learning with Self-SupervisionPermalink
ContentsPermalink
- Abstract
- Related Work
- Few-shot Learning
- Self-supervised Learning
- Methodology
- Explored few-shot learning methods
- Boosting few-shot learning via self-supervision
0. AbstractPermalink
Few-shot Learning & Self-supervised Learning
- common : how to train a model with little / no-labeled data?
Few-shot Learning :
- how to efficiently learn to recognize patterns in the low data regime
Self-supervised Learning :
- looks into it for the supervisory signal
→ this paper exploits both!
1. Related WorkPermalink
(1) Few-shot LearningPermalink
- Gradient desent-based approach
- Metric learning based approach
- Methods learning to map TEST example to a class label by accessing MEMROY modules that store TRAINING examples
This paper : considers 2 approaches from Metric Learning approaches
- (1) Prototypical Network
- (2) Cosine Classifiers
(2) Self-supervised LearningPermalink
- annotation-free pretext task
- extracts semantic features that can be useful to other downstream tasks
This paper : considers a mutli-task setting,
- train the bacbone convnet using joint supervision from the supervised end-task
- and an auxiliary self-supervised pretext task
→ self-supervision as an auxiliary task will bring improvements
2. MethodologyPermalink
Few-shot learning : 2 learning stages ( & 2 set of classes )
Notation
-
Training set
- of base classes ( used in 1st stage ) : Db={(x,y)}⊂I×Yb
- of Nn novel classes ( used in 2nd stage ) : Dn={(x,y)}⊂I×Yn
- each class has K samples ( K=1 or 5 in benchmarks )
→ Nn-way K-shot learning
-
label sets Yn and Yb are disjoint
(1) Explored few-shot learning methodsPermalink
Feature Extractor : Fθ(⋅)
-
Prototypical Network (PN), Cosine Classifiers (CC)
( difference : CC learns actual base classifiers with feature extractors, while PN simply relies on class-average )
a) Prototypical Network (PN)Permalink
[ 1st learning stage ]Permalink
-
feature extractor Fθ(⋅) is learned on sampled few-shot classification sub-problems
-
procedure
-
subset Y∗⊂Yb of N∗ base classes ( = support classes ) are sampled
- ex) cat, dog, ….
-
for each of them, K training examples are randomly picked from within Db
- ex) (cat1, cat2, .. catK) , (dog1, dog2, … dogK)
→ training dataset D∗
-
-
prototype : average feature for each class j∈Y∗
- pj=1K∑x∈Xj∗Fθ(x), with Xj∗={x∣(x,y)∈D∗,y=j}
-
build simple similarity-based classifier with prototype
Output ( for input xq ) :
-
for each class j , the normalized classification score is…
Cj(Fθ(xq);D∗)=softmaxj[sim(Fθ(xq),pi)i∈Y∗].
Loss function ( of 1st learning stage ) :
- Lfew(θ;Db)=ED∗∼Db(xq,yq)[−logCyq(Fθ(xq);D∗)].
[ 2nd learning stage ]Permalink
- feature extractor is FROZEN
- classifier of novel classes is defined as C(⋅;Dn)
- prototypes defined as in pj=1K∑x∈Xj∗Fθ(x) with D∗=Dn.
b) Cosine Classifiers (CC)Permalink
[ 1st learning stage ]Permalink
- trains the feature extractor Fθ together with a cosine-similarity based classifier
- Wb=[w1,…,wNb] : matrix of the d-dimensional classification weight vectors
- output : normalized score for image x
- Cj(Fθ(x);Wb)=softmaxj[γcos(Fθ(x),wi)i∈Yb].
Loss function ( of 1st learning stage ) :
- Lfew(θ,Wb;Db)=E(x,y)∼Db[−logCy(Fθ(x);Wb)].
[ 2nd learning stage ]Permalink
- compute one representative feature wj for each new class
- by averaging K samples in Dn
- define final classifier C(.;[w1⋯wNn])
(2) Boosting few-shot learning via self-supervisionPermalink
Propose to leverage progress in self-supervised feature learning to improve few-shot learning
[ 1st stage ]
-
propose to extend the training of feature extractor Fθ(.),
by including self-supervised task
2 ways to incorporate SSL to few-shot learning
- (1) using an auxiliary loss function, based on self-supervised task
- (2) exploiting unlabeled data in a semi-supervised way
a) Auxiliary loss based on self-supervisionPermalink
min.
-
by adding auxiliary self-supervised loss in 1st stage
-
L_{\text {few }} : stands for PN few-shot loss / CC loss
Image Rotation
- classes : \mathcal{R}=\left\{0^{\circ}, 90^{\circ}, 180^{\circ}, 270^{\circ}\right\}
- network R_\phi predicts the rotation class r
- L_{\text {self }}(\theta, \phi ; X)=\underset{\mathbf{x} \sim X}{\mathbb{E}}\left[\sum_{\forall r \in \mathcal{R}}-\log R_\phi^r\left(F_\theta\left(\mathbf{x}^r\right)\right)\right].
- X : original training set of non-rotated images
- R_\phi^r(\cdot) : predicted normalized score for rotation r
Relative patch location
-
divide it into 9 regions over 3x3 grid
- \overline{\mathrm{x}}^0 : central image patch
- \overline{\mathrm{x}}^1 \cdots \overline{\mathrm{x}}^8 : 8 neighbors
-
compute the representation of each patch
& generate patch feature pairs \left(F_\theta\left(\overline{\mathbf{x}}^0\right), F_\theta\left(\overline{\mathbf{x}}^p\right)\right) by concatenation.
-
L_{\mathrm{self}}(\theta, \phi ; X)=\underset{\mathbf{x} \sim X}{\mathbb{E}}\left[\sum_{p=1}^8-\log P_\phi^p\left(F_\theta\left(\overline{\mathbf{x}}^0\right), F_\theta\left(\overline{\mathbf{x}}^p\right)\right)\right].
- X : original training set of non-rotated images
- P_\phi^p : predicted normalized score for relative location p.
b) Semi-supervised few-shot learningPermalink
- does not depend on class labels
- can obtain information from additional unlabeled data
- \min _{\theta,\left[W_b\right], \phi} L_{\mathrm{few}}\left(\theta,\left[W_b\right] ; D_b\right)+\alpha \cdot L_{\mathrm{self}}\left(\theta, \phi ; X_b \cup X_u\right).