All about “SAM”
(Reference: https://www.youtube.com/watch?v=eYhvJR4zFUM)
Contents
- Image Segmentation
- Introduction to Segment Anything (SAM)
- Promptable Segmentation Task
- Segment Anything Model
- Segment Anything Dataset
- [A] Promptable Segmentation Task
- [B] Segment Anything Model
- Image Encoder
- Prompt Encoder
- Mask Decoder
1. Image Segmentation
Definition
Process of partitioning a digital image into multiple regions (or segments)
- Pixels belonging to the same region share some (semantic) characteristics
Challenges
- Difficult & Expensive to label
- Models are usually application-specific
- e.g., Medical \(\rightarrow\) Pedistrian detection (?)
- Previous models are usually not promptable
- e.g., can’t tell the model to only segment “people”
2. Introduction to Segment Anything (SAM)
Three innovations
-
Promptable Segmentation Task
- Segment Anything Model
- Segment Anything Dataset (and its Segment Anything Engine)
(1) Promptable Segmentation Task
Allows to find masks given a prompt of …
- (1) Points (e.g., mouse click)
- (2) Boxes (e.g., rectangle defined by user)
- (3) Text prompts (e.g., “find all dogs”)
(2) Segment Anything Model
- (1) Fast encoder-decoder model
- (2) Ambiguity-aware
- e.g., Given a point … it may correspond to (a) or (b) or (c)
- (a) Part
- (b) Subpart
- (c) Whole
- e.g., Given a point … it may correspond to (a) or (b) or (c)
(3) Segment Anything Dataset
- (1) 1.1 billion segmentation masks
- Collected with the Segment Anything Engine
- (2) No human supervision
- All the masks have been generated automatically!
3. [A] Promptable Segmentation Task
Pretraining task: Foundation model …
- for NLP: Next token prediction
- for CV: Promptable Segmentation Task
Goal: Return a “valid” segmentation mask given any prompt
What is “valid” mask?
\(\rightarrow\) Even when prompt is ambiguous, the output should be a reasonable mask!
4. [B] Segment Anything Model
(1) Image Encocer
- MAE pretrained ViT
- Applied prior to prompting the model!
(2) Prompt Encoder
a) Two sets of prompts
(1) Sparse (points, boxes, text)
- 1-a) Points & Boxes: Represent by (1) + (2)
- (1) Positional encodings (PE)
- (2) Learned embeddings for each prompt type
- 1-b) Text: Represent with text encoder from CLIP
(2) Dense (masks)
- Embedded using convolutions
- Summed element-wise with image embedding
b) Details of Sparse prompts
Mapped to 256-dim
- Point: (1) + (2)
- (1) PE of the point’s location
- (2) One of two learned embeddings
- indicate either in the foreground or background
- Box: Embedding pair ((1),(2))
- (1) PE of “top-left corner” + Learned embedding of “top-left corner”
- (2) PE of “bottom-right corner” + Learned embedding of “bottom-lright ft corner”
- Text: embedding vector from text encoder of CLIP
c) Details of Dense prompts
Dense prompts
- Have a spatial correspondence with the image
Downscale the input masks
-
Step 1) Downscale 1: “4x lower resolution” than input image
\(\rightarrow\) Output channels of 16
-
Step 2) Downscale 2: Additional “4x lower resolution”
\(\rightarrow\) Output channels of 4
-
Step 3) Into 256 channels (with 1x1 convolution)
d) Positional Encodings
(3) Mask Decoder
Lightweight mask decoder
- [Input] Image embedding & Prompt Embeddings
- Image embedding: 256x64x64
- Prompt Embeddings ( with Output tokens): \(N_{\text{tokens}}\) x 256
- Output tokens = [CLS] token
- [Output] Output mask & IoU scores
Output tokens: two types of tokens
- (1) For IoU
- (2) For Mask