Boosting Few-Shot Visual Learning with Self-SupervisionPermalink


ContentsPermalink

  1. Abstract
  2. Related Work
    1. Few-shot Learning
    2. Self-supervised Learning
  3. Methodology
    1. Explored few-shot learning methods
    2. Boosting few-shot learning via self-supervision


0. AbstractPermalink

Few-shot Learning & Self-supervised Learning

  • common : how to train a model with little / no-labeled data?


Few-shot Learning :

  • how to efficiently learn to recognize patterns in the low data regime

Self-supervised Learning :

  • looks into it for the supervisory signal

this paper exploits both!


1. Related WorkPermalink

(1) Few-shot LearningPermalink

  1. Gradient desent-based approach
  2. Metric learning based approach
  3. Methods learning to map TEST example to a class label by accessing MEMROY modules that store TRAINING examples


This paper : considers 2 approaches from Metric Learning approaches

  • (1) Prototypical Network
  • (2) Cosine Classifiers


(2) Self-supervised LearningPermalink

  • annotation-free pretext task
  • extracts semantic features that can be useful to other downstream tasks


This paper : considers a mutli-task setting,

  • train the bacbone convnet using joint supervision from the supervised end-task
  • and an auxiliary self-supervised pretext task


self-supervision as an auxiliary task will bring improvements


2. MethodologyPermalink

Few-shot learning : 2 learning stages ( & 2 set of classes )


Notation

  • Training set

    • of base classes ( used in 1st stage ) : Db={(x,y)}I×Yb
    • of Nn novel classes ( used in 2nd stage ) : Dn={(x,y)}I×Yn
      • each class has K samples ( K=1 or 5 in benchmarks )

    Nn-way K-shot learning

  • label sets Yn and Yb are disjoint


(1) Explored few-shot learning methodsPermalink

Feature Extractor : Fθ()

  • Prototypical Network (PN), Cosine Classifiers (CC)

    ( difference : CC learns actual base classifiers with feature extractors, while PN simply relies on class-average )


a) Prototypical Network (PN)Permalink

[ 1st learning stage ]Permalink

  • feature extractor Fθ() is learned on sampled few-shot classification sub-problems

  • procedure

    • subset YYb of N base classes ( = support classes ) are sampled

      • ex) cat, dog, ….
    • for each of them, K training examples are randomly picked from within Db

      • ex) (cat1, cat2, .. catK) , (dog1, dog2, … dogK)

      training dataset D

  • prototype : average feature for each class jY

    • pj=1KxXjFθ(x), with Xj={x(x,y)D,y=j}
  • build simple similarity-based classifier with prototype


Output ( for input xq ) :

  • for each class j , the normalized classification score is…

    Cj(Fθ(xq);D)=softmaxj[sim(Fθ(xq),pi)iY].


Loss function ( of 1st learning stage ) :

  • Lfew(θ;Db)=EDDb(xq,yq)[logCyq(Fθ(xq);D)].


[ 2nd learning stage ]Permalink

  • feature extractor is FROZEN
  • classifier of novel classes is defined as C(;Dn)
    • prototypes defined as in pj=1KxXjFθ(x) with D=Dn.


b) Cosine Classifiers (CC)Permalink

[ 1st learning stage ]Permalink

  • trains the feature extractor Fθ together with a cosine-similarity based classifier
  • Wb=[w1,,wNb] : matrix of the d-dimensional classification weight vectors
  • output : normalized score for image x
    • Cj(Fθ(x);Wb)=softmaxj[γcos(Fθ(x),wi)iYb].


Loss function ( of 1st learning stage ) :

  • Lfew(θ,Wb;Db)=E(x,y)Db[logCy(Fθ(x);Wb)].


[ 2nd learning stage ]Permalink

  • compute one representative feature wj for each new class
    • by averaging K samples in Dn
  • define final classifier C(.;[w1wNn])


(2) Boosting few-shot learning via self-supervisionPermalink

figure2


Propose to leverage progress in self-supervised feature learning to improve few-shot learning


[ 1st stage ]

  • propose to extend the training of feature extractor Fθ(.),

    by including self-supervised task


2 ways to incorporate SSL to few-shot learning

  • (1) using an auxiliary loss function, based on self-supervised task
  • (2) exploiting unlabeled data in a semi-supervised way


a) Auxiliary loss based on self-supervisionPermalink

min.

  • by adding auxiliary self-supervised loss in 1st stage

  • L_{\text {few }} : stands for PN few-shot loss / CC loss


Image Rotation

  • classes : \mathcal{R}=\left\{0^{\circ}, 90^{\circ}, 180^{\circ}, 270^{\circ}\right\}
  • network R_\phi predicts the rotation class r
  • L_{\text {self }}(\theta, \phi ; X)=\underset{\mathbf{x} \sim X}{\mathbb{E}}\left[\sum_{\forall r \in \mathcal{R}}-\log R_\phi^r\left(F_\theta\left(\mathbf{x}^r\right)\right)\right].
    • X : original training set of non-rotated images
    • R_\phi^r(\cdot) : predicted normalized score for rotation r


Relative patch location

  • divide it into 9 regions over 3x3 grid

    • \overline{\mathrm{x}}^0 : central image patch
    • \overline{\mathrm{x}}^1 \cdots \overline{\mathrm{x}}^8 : 8 neighbors
  • compute the representation of each patch

    & generate patch feature pairs \left(F_\theta\left(\overline{\mathbf{x}}^0\right), F_\theta\left(\overline{\mathbf{x}}^p\right)\right) by concatenation.

  • L_{\mathrm{self}}(\theta, \phi ; X)=\underset{\mathbf{x} \sim X}{\mathbb{E}}\left[\sum_{p=1}^8-\log P_\phi^p\left(F_\theta\left(\overline{\mathbf{x}}^0\right), F_\theta\left(\overline{\mathbf{x}}^p\right)\right)\right].

    • X : original training set of non-rotated images
    • P_\phi^p : predicted normalized score for relative location p.


b) Semi-supervised few-shot learningPermalink

  • does not depend on class labels
  • can obtain information from additional unlabeled data
  • \min _{\theta,\left[W_b\right], \phi} L_{\mathrm{few}}\left(\theta,\left[W_b\right] ; D_b\right)+\alpha \cdot L_{\mathrm{self}}\left(\theta, \phi ; X_b \cup X_u\right).

Categories: ,

Updated: