DINOv2: Learning Robust Visual Features without Supervision
Oquab, Maxime, et al. "Dinov2: Learning robust visual features without supervision." arXiv preprint arXiv:2304.07193 (2023).
- https://aipapersacademy.com/dinov2-from-meta-ai-finally-a-foundational-model-in-computer-vision/
- https://arxiv.org/pdf/2304.07193
- Introduction
- How to use DINO v2
- DINO v2 Models Distillation
- SSL with Large Curated Data
- Pixel Level Understanding
1. Introduction
- Computer vision model from Meta AI
- Foundational model
- Pretrained ViT model (1B params)
2. How to use DINO v2
Load it using pytorch code from DINOv2 GitHub page.)
import torch
dinov2_vits14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14')
dinov2_vitb14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14')
dinov2_vitl14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14')
dinov2_vitg14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14')
3. DINO v2 Models Distillation
Teacher-student distillation
- (Teacher) Large pretrained DINOv2 model \(\rightarrow\) Freeze
- (Student) Smaller model \(\rightarrow\) Train
Distillation process
- Aims to minimize the difference between the embeddings
Findings: Better results with distillation (comparing to training smaller models from scratch)
- (In practice) Use multiple students (use the average values)
4. SSL with Large Curated Data
(Model size) DINOv2 > DINO
\(\rightarrow\) Need for more training data to train DINOv2 using SSL!
How to increase data size?
(Previous works) Increase uncurated data size with SSL
\(\rightarrow\) Drop in quality
(DINOv2) Automated pipeline to create a curated dataset
\(\rightarrow\) Key factor for reaching SOTA
# of Data
- Starts from 25 sources of data that include 1.2 B images
- Results with 142M curated images.
Curation pipeline: Multiple filtering steps
- (Original uncurated dataset) Lot of cat images (comparing to non-cat images)
- Good in cat
- Bad in other domains
\(\rightarrow\) Solution: clustering
Grouping images based on similarities
Sample from each group a similar number of images
\(\rightarrow\) Enable to create a smaller but more diverse dataset!
5. Pixel Lvel Understanding
Remarkable capability to grasp pixel level information!