Sapiens: Foundation for Human Vision Models

Khirodkar, Rawal, et al. "Sapiens: Foundation for human vision models." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024.

참고:

  • https://aipapersacademy.com/sapiens/
  • https://arxiv.org/pdf/2408.12569


Contents

  1. Various tasks
  2. Humans-300M
    1. Construction
    2. Statistics
    3. Dataset Comparison
  3. SSL Pretraining
  4. Task-specific Models
  5. Experiments


1. Various tasks

Sapiens: Foundation for Human Vision Models

  • Family of models that target four fundamental human-centric tasks
  • by Meta AI


figure2

  • Pose Estimation: Detects the location of key points of the human body in the input image.
  • Body-part Segmentation: Determines which pixels combine the different body parts.
  • Depth Estimation: Determines the depth of the pixels
    • front = brighter, back = darker
  • Surface Normal Estimation: Provides orientation about the shape of the object


2. Humans-300M

(1) Construction

Curating a Human Images Dataset

figure2


(2) Statistics

figure2


(3) Dataset Comparison

figure2


3. SSL Pretraining

figure2

  • Pretraining task: MAE (Masked Auto Encoder)
  • Encoder: Vision Transformer (ViT) architecture


4. Task-specific Models

Using the pretrained model..

  • Add a new task-specific decoder model
  • For each task: Small labeled dataset

\(\rightarrow\) Do this for 4 different tasks!


5. Experiments

(1) Reconstructed Results

Reconstructed results (Pretraining quality)

figure2


Categories: , ,

Updated: