Meta-Transformer: A Unified Framework for Multimodal LearningPermalink
Zhang, Yiyuan, et al. "Meta-transformer: A unified framework for multimodal learning." arXiv preprint arXiv:2307.10802 (2023).
참고:
- https://aipapersacademy.com/meta-transformer/
- https://arxiv.org/pdf/2307.10802
ContentsPermalink
- Introduction
- Architecture
- Unified Multimodal Transformer Pretraining
- Experiments
1. IntroductionPermalink
Meta-Transformer
- Multimodal learning
- Process information from 12 different modalities
Challenges: Each data modality is structured differently
2. ArchitecturePermalink
Goal: Produce embedding for any modality
- Can process different modalities as inputs
How can the transformer process information from different types of data?
→ Data-to-sequence tokenizer
- Consists of small tokenizers (per modality)
Task-specific head
- To solve various tasks with the obtained representation
3. Unified Multimodal Transformer PretrainingPermalink
( The paper does not share a lot of information about the pre-training process )
Dataset: LAION-2B dataset
- (Text,Image) pair dataset
Task: Contrastive learning