Vision Transformers Need Registers – Fixing a Bug in DINOv2?

Darcet, Timothée, et al. "Vision transformers need registers." ICLR 2024

참고:

  • https://aipapersacademy.com/vision-transformers-need-registers/
  • https://arxiv.org/pdf/2309.16588


Contents

  1. Background: Visual Features
  2. The Problem: Attention Map Artifacts
  3. The Fix: Registers
  4. Results
  5. Conclusion


Abstract

Proposal: Vision Transformers (ViTs) registers

  • Share authors with DINOv2 paper


1. Background: Visual Features

figure2

  • Models from scratch (X)

  • Pre-trained large computer vision model (O)

    • e.g., DINOv2 ( a large Vision Transformer (ViT) model )

    • Output = visual features or embeddings

      \(\rightarrow\) Capture the semantic of the input image


2. The Problem: Attention Map Artifacts

(1) Attention Map

ViT = Attention mechanism

  • Attention map = Visualization of the attention values

    \(\rightarrow\) Which parts of the image are important!


(2) Object Discovery

Object Discovery: One usage for attention maps

  • Object detection = Locates only known labeled object
  • Object discovery = Also locates unknown not-labeled object
    • e.g.,) Object discovery using attention maps method = LOST (using DINOv1)

figure2


(3) Artifacts

LOST method

\(\rightarrow\) Found that the attention map in DINOv2 is not as semantically clear as in DINOv1 !!

  • Ex) Outlier peaks in DINOv2 attention map = artifacts

figure2


Not only DINOv2, but other large visual transformer models!

(e.g., OpenCLIP, DeIT)

figure2


(4) Analyzing the Artifacts

L2 norm values of the features extracted for image patches

  • DINOv1: OK
  • DINOv2:
    • Majority of features are of low value
    • But a small proportion of patches have high norm!

figure2


(5) What Data Do The Artifacts Capture?

Conclusion

  1. Artifacts lack spatial information
  2. Artifacts hold global information


a) Artifacts lack spatial information

High-norm features

  • Contain less information about their position in the original image

  • (Left chart) Orange line

    • Artifacts are located in patches that are very similar to their surrounding patches

      ( = Confirms that the artifacts appear in background )

  • (Right chart) Trained models to ..

    • Task 1) predict the original position of a token

    • Task 2) reconstruct the original pixels

      \(\rightarrow\) In both tasks, performs worse for the high-norm tokens!

figure2


b) Artifacts hold global information

Classification results when using embeddings from DINOv2 as inputs

  • (Row 1) class token
  • (Row 3) > (Row 2)

figure2


(6) When do the High-Norm Tokens Appear?

figure2


Figure 4(a)

  • More common from the middle to the last layers

Figure 4(b)

  • Start to appear after training the model for a while

    ( Not at the beginning of the training process )

Figure 4(c)

  • Only appear on larger models


Conclusion: large and sufficiently trained models learn to recognize redundant tokens, and use them to store global information.


3. The Fix: Registers

(1) Key Idea

Key idea) If the model learns to use tokens that are less important in order to store global information…

\(\rightarrow\) We can add more tokens that the model will use to store that information

( instead of the tokens from the original image )


(2) Registers

figure2


Solution: Registers (= The added tokens)

  • (1) Added to the input
  • (2) Discarded from the output
    • Assumption: The model will use them instead of the image patch tokens to store the global information


(3) Do Registers Prevent Artifacts?

figure2

figure2


4. Results

figure2

figure2


5. Conclusion

  • Highlights how unimportant tokens store useful information.
  • Registers nearly eliminate these artifacts.
  • (Experiment 1) Classification, segmentation, and depth
    • Registers yield minor gains but increase memory and latency, making their use case-dependent.
  • (Experiment 2) Object discovery
    • Improves significantly with DINOv2 but remains inferior to DINOv1.

Categories: , ,

Updated: