Selfie : Self-supervised Pretraining for Image Embedding

Abstract
Method
1. Pretraining Details
2. Attention Pooling

0. Abstract

introduce a pretraining technique called Selfie

( = SELF-supervised Image Embedding )

generalizes the concept of masked language modeling of BERT to image
learns to select the correct patch

1. Method

2 stage

(1) pre-training
(2) fine-tuning

(1) pre-training stage

\(P\) : patch processing network
produce 1 feature vector per patch ( for both ENC & DEC )
Encoder
- feature vectors are pooled by attention pooling network \(A\)
  
  \(\rightarrow\) produce a single vector \(u\)
Decoder
- no pooling
- feature vectors are sent directly to the computation loss
Encoder & Decoder : jointly trained

(2) fine-tuning stage

goal : improve ResNet-50

\(\rightarrow\) pretrain the first 3 blocks of this architecture ( = \(P\) )

(1) Pretraining Details

use a part of the input image to predict the rest of the image

ex) Patch 1,2,5,6,7,9 : sent to Encoder
ex) Patch 3,4,8 : sent to Decoder

a) Patch Sampling method

image size 32x32 \(\rightarrow\) patch size = 8x8

image size 224x224 \(\rightarrow\) patch size = 32x32

b) Patch processing network

focus on improving ResNet-50

use it as the path processing network \(P\)

c) Efficient implementation of mask prediction

for efficiency… decoder is implemented to predict multiple correct patches for multiple locations at the same time

(2) Attention Pooling

attention pooling network : \(A\)

a) Transformer as pooling operation

notation

patching processing network : \(P\)
input vectors : \(\left\{h_1, h_2, \ldots, h_n\right\}\)

\(\rightarrow\) pool them to single vector \(u\)

attention pooling

\(u, h_1^{\text {output }}, h_2^{\text {output }}, \ldots, h_n^{\text {output }}=\operatorname{TransformerLayers}\left(u_o, h_1, h_2, \ldots, h_n\right)\).

( use only \(u\) as the pooling fresult! )

b) Positional embedding

( image size 32x32 ) : 16 patches ( of size 8x8 )

( image size 224x224 ) : 49 patches ( of size 32x32 )

\(\rightarrow\) instead of learning 49 positional embeddings … only need to learn 7+7 (=14) embeddings

Twitter Facebook LinkedIn

(paper 24) Selfie

Seunghan Lee