DALL-E; Zero-Shot Text-to-Image Generation

Ramesh, Aditya, et al. "Zero-shot text-to-image generation." International conference on machine learning. Pmlr, 2021.

참고: https://www.youtube.com/watch?v=-e-vW1j132A&t=2257s


Contents

  1. Introduction
  2. DALL-E
  3. Visualization


1. Introduction

  • GPT-3의 확장형태 (Auto-regressive), 12B params.
  • CV + NLP \(\rightarrow\) Text-to-Image task
  • Zero-shot performance


2. DALL-E

Text & Image token을 “single stream”으로 모델링

Issues

  • (1) Memory issue (pixel-level)
  • (2) Short-range dependence


(1) Stage 1

Image token 생성

  • (1) Discrete VAE 사용
  • (2) (256x256) \(\rightarrow\) (32,32) image token으로 압축

  • (3) (codebook 내의) # Token = 8192
  • (4) context size 192배 줄임
    • (256x256x3) \(\rightarrow\) (32x32)


픽셀을 순차적으로 생성하는 것보다 효율적!

  • 나름, (디테일은 사라질지라도) 핵심 요소들은 잘 유지가 됨!

figure2


(2) Stage 2

Text token & Image token 합치기

Concatenate (a) & (b)

  • (a) (최대) 256 BPE-encoded text token
  • (b) (32x32) image token

\(\rightarrow\) Text & Image token의 joint distribution을 학습한다


아래의 과정을 1024개의 image token이 다 채워질때까지 autoregressive하게 반복한다!

figure2


(3) Interpretation

\(p_{\theta, \psi}(x, y, z)=p_\theta(x \mid y, z) p_{\varphi}(y, z)\).

  • (1) \(p_{\varphi}(y, z)\): Transformer
  • (2) \(p_\theta(x \mid y, z)\): Discrete VAE decoder
  • Notation
    • \(x\): image
    • \(y\): text ( = caption )
    • \(z\): image의 token


Maximizing ELBO

\(\ln p_{\theta, v}(x, y) \underset{z \sim q_\phi(z \mid x)}{\mathbb{E}}\left(\ln p_\theta(x \mid y, z)-\beta D_{\mathrm{KL}}\left(q_\phi(y, z \mid x), p_\psi(y, z)\right)\right)\).

  • \(p_{\theta}\): dVAE decoder ( image token으로 image 예측)
  • \(q_{\phi}\): dVAE encoder ( image로 image token 예측)
  • \(p_{\varphi}\): Transformer ( image token & text token의 joint distn 모델링 )


(4) Details

Stage 1. Learning the visual code book

  • dVAE encoder & decoder 학습
    • 즉, \(\theta, \phi\)에 대해 ELBO를 maximize
  • 오직 image와 관련된 부분 (text (X))


Stage 2. Learning prior

  • Transformer 학습
    • 즉, (\(\theta, \phi\)는 고정해둔 채) \(\varphi\)에 대해 ELBO를 maximize
  • Text token & Image token에 대한 prior를 학습함


Transformer: decoder-only model

  • (1) Text-to-text ATT: standard causal mask
  • (2) Image-to-Image ATT: row/col/convolutional attention mask

figure2


3. Visualization

figure2

Categories: , ,

Updated: