I-JEPA: The First Human-Like Computer Vision Model
Assran, Mahmoud, et al. "Self-supervised learning from images with a joint-embedding predictive architecture." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
참고:
- https://aipapersacademy.com/i-jepa-a-human-like-computer-vision-model/
- https://arxiv.org/pdf/2301.08243
Contents
- Introduction
- SSL for Images
- I-JEPA
    - Introduction
- Architecture
 
1. Introduction
I-JEPA
- 
    Image-based Joint-Embedding Predictive Architecture 
- Open-source computer vision model (from Meta AI)
- More human-like AI
2. SSL for Images
2 common approaches for SSL from images
- (1) Invariance-based (e.g., CL)
- (2) Generative (e.g., MM)


Comparison
| Aspect | Invariance-based (e.g., CL) | Generative (e.g., MM) | 
|---|---|---|
| Focus | Learns low-level features (e.g., textures, shapes) | Learns both low-level and high-level features (e.g., global context) | 
| High-level Semantics | Struggles with high-level context (e.g., object relationships) | Better at understanding high-level concepts (e.g., scene or object understanding) | 
| Low-level Semantics | Strong at low-level details (e.g., edges, patterns) | Good at low-level details, with more context around them | 
| Best for High-level Tasks | Not ideal for tasks needing big-picture understanding | Great for tasks that need overall context (e.g., segmentation, captioning) | 
| Best for Low-level Tasks | Excellent for detailed tasks (e.g., texture recognition) | Works well, but might be more complex than needed for simple tasks | 
3. I-JEPA
(1) Introduction
Goal: Improve the semantic level of the representations
- w/o prior knowledge (e.g., data augmentation)
Main Idea: predict missing information in abstract representation space

(2) Architecture
Patchily: non-overlapping patches
3 components
- (1) Context encoder
- (2) Target encoder
- (3) Predictor
\(\rightarrow\) Each of them is a different Visual Transformer model.

a) Target Encoder
- 
    (Input) Sequence of patches 
- 
    (Output) Patch-level representations 
- 
    Sampling target blocks - 
        Sample blocks of patch-level representations (with possible overlapping) \(\rightarrow\) Becomes a target blocks 
- 
        Note that targets are in the representation space. \(\rightarrow\) Thus, each target is obtained by masking “after” the target encoder! 
 
- 
        
b) Context Encoder
- 
    (Input) Sequence of patches 
- 
    (Output) Patch-level representations 
- 
    Sampling context blocks - 
        Significantly larger in size than the target blocks 
- 
        Sampled independently from the target block \(\rightarrow\) There could be an overlap \(\rightarrow\) Thus, remove the overlapping patches! 
 
- 
        

c) Predictor
- 
    Predict three target block representations. 
- 
    For each target block representation, we feed the predictor with .. - 
        (1) Output from the context encoder 
- 
        (2) Mask token ( = Includes learnable vector and positional embeddings that match the target block location ) 
 
- 
        
- 
    Loss: Average L2 distance between the predictions 
EMA
Target encoder parameters are updated using EMA of the context encoder parameters.
