DINOv2: Learning Robust Visual Features without Supervision

Oquab, Maxime, et al. "Dinov2: Learning robust visual features without supervision." arXiv preprint arXiv:2304.07193 (2023).

참고:

  • https://aipapersacademy.com/dinov2-from-meta-ai-finally-a-foundational-model-in-computer-vision/
  • https://arxiv.org/pdf/2304.07193


Contents

  1. Introduction
  2. How to use DINO v2
  3. DINO v2 Models Distillation
  4. SSL with Large Curated Data
  5. Pixel Level Understanding


1. Introduction

DINOv2

  • Computer vision model from Meta AI
  • Foundational model
    • Pretrained ViT model (1B params)

figure2


2. How to use DINO v2

Load it using pytorch code from DINOv2 GitHub page.)

import torch
dinov2_vits14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14')
dinov2_vitb14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14')
dinov2_vitl14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14')
dinov2_vitg14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14')


3. DINO v2 Models Distillation

figure2

Teacher-student distillation

  • (Teacher) Large pretrained DINOv2 model \(\rightarrow\) Freeze
  • (Student) Smaller model \(\rightarrow\) Train


Distillation process

  • Aims to minimize the difference between the embeddings


Findings: Better results with distillation (comparing to training smaller models from scratch)

  • (In practice) Use multiple students (use the average values)


4. SSL with Large Curated Data

figure2

(Model size) DINOv2 > DINO

\(\rightarrow\) Need for more training data to train DINOv2 using SSL!


How to increase data size?

(Previous works) Increase uncurated data size with SSL

\(\rightarrow\) Drop in quality

  • (DINOv2) Automated pipeline to create a curated dataset

    \(\rightarrow\) Key factor for reaching SOTA


# of Data

  • Starts from 25 sources of data that include 1.2 B images
  • Results with 142M curated images.

figure2


Curation pipeline: Multiple filtering steps

  • (Original uncurated dataset) Lot of cat images (comparing to non-cat images)
    • Good in cat
    • Bad in other domains

\(\rightarrow\) Solution: clustering

  • Grouping images based on similarities

  • Sample from each group a similar number of images

    \(\rightarrow\) Enable to create a smaller but more diverse dataset!


5. Pixel Lvel Understanding

Remarkable capability to grasp pixel level information!

Categories: , ,

Updated: