DINOv2: Learning Robust Visual Features without Supervision

Oquab, Maxime, et al. "Dinov2: Learning robust visual features without supervision." arXiv preprint arXiv:2304.07193 (2023).

참고:

https://aipapersacademy.com/dinov2-from-meta-ai-finally-a-foundational-model-in-computer-vision/
https://arxiv.org/pdf/2304.07193

Introduction
How to use DINO v2
DINO v2 Models Distillation
SSL with Large Curated Data
Pixel Level Understanding

1. Introduction

DINOv2

Computer vision model from Meta AI
Foundational model
- Pretrained ViT model (1B params)

2. How to use DINO v2

Load it using pytorch code from DINOv2 GitHub page.)

import torch
dinov2_vits14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14')
dinov2_vitb14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14')
dinov2_vitl14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14')
dinov2_vitg14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14')

3. DINO v2 Models Distillation

Teacher-student distillation

(Teacher) Large pretrained DINOv2 model \(\rightarrow\) Freeze
(Student) Smaller model \(\rightarrow\) Train

Distillation process

Aims to minimize the difference between the embeddings

Findings: Better results with distillation (comparing to training smaller models from scratch)

(In practice) Use multiple students (use the average values)

4. SSL with Large Curated Data

(Model size) DINOv2 > DINO

\(\rightarrow\) Need for more training data to train DINOv2 using SSL!

How to increase data size?

(Previous works) Increase uncurated data size with SSL

\(\rightarrow\) Drop in quality

(DINOv2) Automated pipeline to create a curated dataset

\(\rightarrow\) Key factor for reaching SOTA

# of Data

Starts from 25 sources of data that include 1.2 B images
Results with 142M curated images.

Curation pipeline: Multiple filtering steps

(Original uncurated dataset) Lot of cat images (comparing to non-cat images)
- Good in cat
- Bad in other domains

\(\rightarrow\) Solution: clustering

Grouping images based on similarities
Sample from each group a similar number of images

\(\rightarrow\) Enable to create a smaller but more diverse dataset!

5. Pixel Lvel Understanding

Remarkable capability to grasp pixel level information!

Twitter Facebook LinkedIn

DINOv2; Learning Robust Visual Features without Supervision

Seunghan Lee