Selfie : Self-supervised Pretraining for Image Embedding


Contents

  1. Abstract
  2. Method
    1. Pretraining Details
    2. Attention Pooling


0. Abstract

introduce a pretraining technique called Selfie

( = SELF-supervised Image Embedding )

  • generalizes the concept of masked language modeling of BERT to image
  • learns to select the correct patch


1. Method

2 stage

  • (1) pre-training
  • (2) fine-tuning


(1) pre-training stage

  • \(P\) : patch processing network

  • produce 1 feature vector per patch ( for both ENC & DEC )

  • Encoder

    • feature vectors are pooled by attention pooling network \(A\)

      \(\rightarrow\) produce a single vector \(u\)

  • Decoder

    • no pooling
    • feature vectors are sent directly to the computation loss
  • Encoder & Decoder : jointly trained


(2) fine-tuning stage

  • goal : improve ResNet-50

    \(\rightarrow\) pretrain the first 3 blocks of this architecture ( = \(P\) )

(1) Pretraining Details

  • use a part of the input image to predict the rest of the image

figure2

  • ex) Patch 1,2,5,6,7,9 : sent to Encoder
  • ex) Patch 3,4,8 : sent to Decoder


a) Patch Sampling method

image size 32x32 \(\rightarrow\) patch size = 8x8

image size 224x224 \(\rightarrow\) patch size = 32x32


b) Patch processing network

focus on improving ResNet-50

use it as the path processing network \(P\)


c) Efficient implementation of mask prediction

for efficiency… decoder is implemented to predict multiple correct patches for multiple locations at the same time


(2) Attention Pooling

attention pooling network : \(A\)


a) Transformer as pooling operation

notation

  • patching processing network : \(P\)

  • input vectors : \(\left\{h_1, h_2, \ldots, h_n\right\}\)

    \(\rightarrow\) pool them to single vector \(u\)


attention pooling

  • \(u, h_1^{\text {output }}, h_2^{\text {output }}, \ldots, h_n^{\text {output }}=\operatorname{TransformerLayers}\left(u_o, h_1, h_2, \ldots, h_n\right)\).

    ( use only \(u\) as the pooling fresult! )


b) Positional embedding

( image size 32x32 ) : 16 patches ( of size 8x8 )

( image size 224x224 ) : 49 patches ( of size 32x32 )

\(\rightarrow\) instead of learning 49 positional embeddings … only need to learn 7+7 (=14) embeddings

Categories: ,

Updated: