Self-supervised Label Augmentation via Input TransformationsPermalink


ContentsPermalink

  1. Abstract
  2. Introduction
  3. Self-supervised Label Augmentation (SLA)
    1. Multi-task Learning with Self-supervision
    2. Eliminating Invariance via Joint-label Classifier


0. AbstractPermalink

self-supervised learning

  • constructs artificial labels


This paper :

  • constructing artifical labels works well even under fully labeled datasets

  • main idea : learn a single unified task w.r.t joint distn of the (1) original labels & (2) self-supervised labels

    ( augment these 2 labels )

  • propose a novel knowledge transfser technique ( = self-distillation )

    faster inference!


1. IntroductionPermalink

ContributionPermalink

  1. multi-task approach :

    • enforcing invariance to transformation

      may lead to bad result in some cases

  2. propose a simple & effective algorithm

    • learn a single unified task
    • use joint distribution of original & self-supervised labels


Self-supervised Label Augmentation (SLA)Permalink

  • proposed label augmentation method

  • do not force any invariance to tranfsormation

  • assign different labels for each transformation

    possible to make a prediction by aggregation

    ( act as an ensemble )

  • propose a self-distillation technique

    ( transfers knowledge of the multiple inferences into a single inference )


2. Self-supervised Label Augmentation (SLA)Permalink

( setting : fully-supervised scenario )

  1. problems of conventional multi-task learning approach

  2. introduce proposed algorithm

    ( + 2 additional techniques : (1) aggregation & (2) self-distillation )

    • (1) aggregation :

      • uses all differently augmented samples

      • provide an ensemble effect, using a single model

    • (2) self-distillation :

      • transfers the aggregated knowledge into the model itself for acceleration


NotationPermalink

  • input : xRd
  • number of classes : N
  • Cross entropy loss : LCE
  • softmax classifier : σ(;u)
    • σi(z;u)=exp(uiz)/kexp(ukz).
  • embedding vector : z=f(x;θ)
  • augmented sample : ˜x=t(x)
  • embedding vector of augmented sample : ˜z=f(˜x;θ)


figure2


(1) Multi-task Learning with Self-supervisionPermalink

transformation-based self-supervised learning

  • learn to predict which transformation is applied
  • usually use 2 losses (multi-task learning)


Loss function :

LMT(x,y;θ,u,v)=1MMj=1LCE(σ(˜zj;u),y)+LCE(σ(˜zj;v),j).

  • {tj}Mj=1 : pre-defined transformations
  • ˜xj=tj(x) : transformed sample by tj
  • ˜zj=f(˜xj;θ) : embedding
  • σ(;u) : classifier ( for primary task )
  • σ(;v) : classifier ( for self-supervised task )

forces the primary classifier to be invariant to transformation


LIMITATION

  • depending on the type of transformation….

    might hurt performance!

  • ex) rotation … number 6 & 9


(2) Eliminating Invariance via Joint-label ClassifierPermalink

  • remove the unnecessary INVARIANT propoerty of the classifier σ(f();u)

  • instead, use a JOINT softmax classifier ρ(;w)

    • joint probability : P(i,j˜x)=ρij(˜z;w)=exp(wij˜z)/k,lexp(wkl˜z)
  • Loss function : LSLA(x,y;θ,w)=1MMj=1LCE(ρ(˜zj;w),(y,j))

    • LCE(ρ(˜z;w),(i,j))=logρij(˜z;w).
  • only increases the number of labels

    ( number of additional parameters are very small )

  • LMT and LSLA :

  • consider the same set of multi-labels


Aggregated InferencePermalink

  • do not need to consider N×M labels

    ( because we already know which transformation is applied )

  • make prediction using conditional probability

    • P(i˜xj,j)=exp(wij˜zj)/kexp(wkj˜zj).
  • for all possible transformations…

    • aggregate the conditonal probabilities!
    • acts as an ensemble model
    • Paggregated (ix)=exp(si)Nk=1exp(sk).


Self-distillation from aggregationPermalink

  • requires to compute ˜zj=f(˜xj) for all j

    M times higher computation cost than single inference

  • Solution : perform self-distillation

    • from ) aggregated knowledge Paggregated (x)

    • to ) another classifier σ(f(x;θ);u)

      can maintain the aggregated knowledge, using only z=f(x)

  • objective function :

    LSLA+SD(x,y;θ,w,u)=LSLA(x,y;θ,w)+DKL(Paggregated (x)∣∣σ(z;u))+βLCE(σ(z;u),y)

Categories: ,

Updated: