Sapiens: Foundation for Human Vision Models

Khirodkar, Rawal, et al. "Sapiens: Foundation for human vision models." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024.

참고:

https://aipapersacademy.com/sapiens/
https://arxiv.org/pdf/2408.12569

Various tasks
Humans-300M
1. Construction
2. Statistics
3. Dataset Comparison
SSL Pretraining
Task-specific Models
Experiments

1. Various tasks

Sapiens: Foundation for Human Vision Models

Family of models that target four fundamental human-centric tasks
by Meta AI

Pose Estimation: Detects the location of key points of the human body in the input image.
Body-part Segmentation: Determines which pixels combine the different body parts.
Depth Estimation: Determines the depth of the pixels
- front = brighter, back = darker
Surface Normal Estimation: Provides orientation about the shape of the object

2. Humans-300M

(1) Construction

Curating a Human Images Dataset

(2) Statistics

(3) Dataset Comparison

3. SSL Pretraining

Pretraining task: MAE (Masked Auto Encoder)
Encoder: Vision Transformer (ViT) architecture

4. Task-specific Models

Using the pretrained model..

Add a new task-specific decoder model
For each task: Small labeled dataset

\(\rightarrow\) Do this for 4 different tasks!

5. Experiments

(1) Reconstructed Results

Reconstructed results (Pretraining quality)

Twitter Facebook LinkedIn

Sapiens; Foundation for Human Vision Models

Seunghan Lee

Sapiens: Foundation for Human Vision Models

Contents

1. Various tasks

2. Humans-300M

(1) Construction

(2) Statistics

(3) Dataset Comparison

3. SSL Pretraining

4. Task-specific Models

5. Experiments

(1) Reconstructed Results

You May Also Enjoy