( 참고 : 패스트 캠퍼스 , 한번에 끝내는 컴퓨터비전 초격차 패키지 )

Vision Transformer (ViT)


1. Introduction

  • transformer encoder, which consists of alternating layers of Multi-head Self-attention

  • linear embeddings of image patches

    ( + 1d positional embeddings )

  • works well on very large-scale datasets

  • first token

    • aggregate local information
    • used as an input to image classifier

figure2


2. Variants

figure2


3. Code

(1) Transformer

transformer_model = nn.Transformer(nhead=16, num_encoder_layers=12)
src = torch.rand((10, 32, 512))
tgt = torch.rand((20, 32, 512))
out = transformer_model(src, tgt)


(2) ViT

import torchvision.models as models

vit_b_16 = models.vit_b_16(pretrained=True)
vit_b_32 = models.vit_b_32(pretrained=True)
vit_l_16 = models.vit_l_16(pretrained=True)
vit_l_32 = models.vit_l_32(pretrained=True)

Categories:

Updated: