Meta-Transformer: A Unified Framework for Multimodal Learning

Zhang, Yiyuan, et al. "Meta-transformer: A unified framework for multimodal learning." arXiv preprint arXiv:2307.10802 (2023).

참고:

  • https://aipapersacademy.com/meta-transformer/
  • https://arxiv.org/pdf/2307.10802


Contents

  1. Introduction
  2. Architecture
  3. Unified Multimodal Transformer Pretraining
  4. Experiments


1. Introduction

Meta-Transformer

  • Multimodal learning
  • Process information from 12 different modalities

figure2

figure2


Challenges: Each data modality is structured differently


2. Architecture

figure2

figure2

Goal: Produce embedding for any modality

  • Can process different modalities as inputs


How can the transformer process information from different types of data?

\(\rightarrow\) Data-to-sequence tokenizer

  • Consists of small tokenizers (per modality)

figure2


Task-specific head

  • To solve various tasks with the obtained representation


3. Unified Multimodal Transformer Pretraining

( The paper does not share a lot of information about the pre-training process )

Dataset: LAION-2B dataset

  • (Text,Image) pair dataset

Task: Contrastive learning

figure2


4. Experiments

figure2


(1) Overall Performance

figure2


(2) Text

figure2


(3) Image

figure2