Selfie : Self-supervised Pretraining for Image EmbeddingPermalink


ContentsPermalink

  1. Abstract
  2. Method
    1. Pretraining Details
    2. Attention Pooling


0. AbstractPermalink

introduce a pretraining technique called Selfie

( = SELF-supervised Image Embedding )

  • generalizes the concept of masked language modeling of BERT to image
  • learns to select the correct patch


1. MethodPermalink

2 stage

  • (1) pre-training
  • (2) fine-tuning


(1) pre-training stage

  • P : patch processing network

  • produce 1 feature vector per patch ( for both ENC & DEC )

  • Encoder

    • feature vectors are pooled by attention pooling network A

      produce a single vector u

  • Decoder

    • no pooling
    • feature vectors are sent directly to the computation loss
  • Encoder & Decoder : jointly trained


(2) fine-tuning stage

  • goal : improve ResNet-50

    pretrain the first 3 blocks of this architecture ( = P )

(1) Pretraining DetailsPermalink

  • use a part of the input image to predict the rest of the image

figure2

  • ex) Patch 1,2,5,6,7,9 : sent to Encoder
  • ex) Patch 3,4,8 : sent to Decoder


a) Patch Sampling methodPermalink

image size 32x32 patch size = 8x8

image size 224x224 patch size = 32x32


b) Patch processing networkPermalink

focus on improving ResNet-50

use it as the path processing network P


c) Efficient implementation of mask predictionPermalink

for efficiency… decoder is implemented to predict multiple correct patches for multiple locations at the same time


(2) Attention PoolingPermalink

attention pooling network : A


a) Transformer as pooling operationPermalink

notation

  • patching processing network : P

  • input vectors : {h1,h2,,hn}

    pool them to single vector u


attention pooling

  • u,houtput 1,houtput 2,,houtput n=TransformerLayers(uo,h1,h2,,hn).

    ( use only u as the pooling fresult! )


b) Positional embeddingPermalink

( image size 32x32 ) : 16 patches ( of size 8x8 )

( image size 224x224 ) : 49 patches ( of size 32x32 )

instead of learning 49 positional embeddings … only need to learn 7+7 (=14) embeddings

Categories: ,

Updated: