All about “SAM”

(Reference: https://www.youtube.com/watch?v=eYhvJR4zFUM)


Contents

  1. Image Segmentation
  2. Introduction to Segment Anything (SAM)
    1. Promptable Segmentation Task
    2. Segment Anything Model
    3. Segment Anything Dataset
  3. [A] Promptable Segmentation Task
  4. [B] Segment Anything Model
    1. Image Encoder
    2. Prompt Encoder
    3. Mask Decoder


1. Image Segmentation

Definition

Process of partitioning a digital image into multiple regions (or segments)

  • Pixels belonging to the same region share some (semantic) characteristics


Challenges

  1. Difficult & Expensive to label
  2. Models are usually application-specific
    • e.g., Medical \(\rightarrow\) Pedistrian detection (?)
  3. Previous models are usually not promptable
    • e.g., can’t tell the model to only segment “people”


2. Introduction to Segment Anything (SAM)

Three innovations

  1. Promptable Segmentation Task

  2. Segment Anything Model
  3. Segment Anything Dataset (and its Segment Anything Engine)


(1) Promptable Segmentation Task

Allows to find masks given a prompt of …

  • (1) Points (e.g., mouse click)
  • (2) Boxes (e.g., rectangle defined by user)
  • (3) Text prompts (e.g., “find all dogs”)


(2) Segment Anything Model

  • (1) Fast encoder-decoder model
  • (2) Ambiguity-aware
    • e.g., Given a point … it may correspond to (a) or (b) or (c)
      • (a) Part
      • (b) Subpart
      • (c) Whole


(3) Segment Anything Dataset

  • (1) 1.1 billion segmentation masks
    • Collected with the Segment Anything Engine
  • (2) No human supervision
    • All the masks have been generated automatically!


3. [A] Promptable Segmentation Task

Pretraining task: Foundation model …

  • for NLP: Next token prediction
  • for CV: Promptable Segmentation Task


Goal: Return a “valid” segmentation mask given any prompt

What is “valid” mask?

\(\rightarrow\) Even when prompt is ambiguous, the output should be a reasonable mask!

figure2


4. [B] Segment Anything Model

(1) Image Encocer

figure2

  • MAE pretrained ViT
  • Applied prior to prompting the model!


(2) Prompt Encoder

figure2

a) Two sets of prompts

(1) Sparse (points, boxes, text)

  • 1-a) Points & Boxes: Represent by (1) + (2)
    • (1) Positional encodings (PE)
    • (2) Learned embeddings for each prompt type
  • 1-b) Text: Represent with text encoder from CLIP

(2) Dense (masks)

  • Embedded using convolutions
  • Summed element-wise with image embedding


b) Details of Sparse prompts

Mapped to 256-dim

  • Point: (1) + (2)
    • (1) PE of the point’s location
    • (2) One of two learned embeddings
      • indicate either in the foreground or background
  • Box: Embedding pair ((1),(2))
    • (1) PE of “top-left corner” + Learned embedding of “top-left corner”
    • (2) PE of “bottom-right corner” + Learned embedding of “bottom-lright ft corner”
  • Text: embedding vector from text encoder of CLIP


figure2


c) Details of Dense prompts

figure2

Dense prompts

  • Have a spatial correspondence with the image


Downscale the input masks

  • Step 1) Downscale 1: “4x lower resolution” than input image

    \(\rightarrow\) Output channels of 16

  • Step 2) Downscale 2: Additional “4x lower resolution”

    \(\rightarrow\) Output channels of 4

  • Step 3) Into 256 channels (with 1x1 convolution)

figure2


d) Positional Encodings

figure2


(3) Mask Decoder

figure2


Lightweight mask decoder

  • [Input] Image embedding & Prompt Embeddings
    • Image embedding: 256x64x64
    • Prompt Embeddings ( with Output tokens): \(N_{\text{tokens}}\) x 256
      • Output tokens = [CLS] token
  • [Output] Output mask & IoU scores


Output tokens: two types of tokens

  • (1) For IoU
  • (2) For Mask

figure2

Categories: ,

Updated: