DALL-E; Zero-Shot Text-to-Image Generation

Ramesh, Aditya, et al. "Zero-shot text-to-image generation." International conference on machine learning. Pmlr, 2021.

참고: https://www.youtube.com/watch?v=-e-vW1j132A&t=2257s


Contents

  1. Introduction
  2. DALL-E
  3. Visualization


1. Introduction

  • GPT-3의 확장형태 (Auto-regressive), 12B params.
  • CV + NLP $\rightarrow$ Text-to-Image task
  • Zero-shot performance


2. DALL-E

Text & Image token을 “single stream”으로 모델링

Issues

  • (1) Memory issue (pixel-level)
  • (2) Short-range dependence


(1) Stage 1

Image token 생성

  • (1) Discrete VAE 사용
  • (2) (256x256) $\rightarrow$ (32,32) image token으로 압축

  • (3) (codebook 내의) # Token = 8192
  • (4) context size 192배 줄임
    • (256x256x3) $\rightarrow$ (32x32)


픽셀을 순차적으로 생성하는 것보다 효율적!

  • 나름, (디테일은 사라질지라도) 핵심 요소들은 잘 유지가 됨!

figure2


(2) Stage 2

Text token & Image token 합치기

Concatenate (a) & (b)

  • (a) (최대) 256 BPE-encoded text token
  • (b) (32x32) image token

$\rightarrow$ Text & Image token의 joint distribution을 학습한다


아래의 과정을 1024개의 image token이 다 채워질때까지 autoregressive하게 반복한다!

figure2


(3) Interpretation

$p_{\theta, \psi}(x, y, z)=p_\theta(x \mid y, z) p_{\varphi}(y, z)$.

  • (1) $p_{\varphi}(y, z)$: Transformer
  • (2) $p_\theta(x \mid y, z)$: Discrete VAE decoder
  • Notation
    • $x$: image
    • $y$: text ( = caption )
    • $z$: image의 token


Maximizing ELBO

$\ln p_{\theta, v}(x, y) \underset{z \sim q_\phi(z \mid x)}{\mathbb{E}}\left(\ln p_\theta(x \mid y, z)-\beta D_{\mathrm{KL}}\left(q_\phi(y, z \mid x), p_\psi(y, z)\right)\right)$.

  • $p_{\theta}$: dVAE decoder ( image token으로 image 예측)
  • $q_{\phi}$: dVAE encoder ( image로 image token 예측)
  • $p_{\varphi}$: Transformer ( image token & text token의 joint distn 모델링 )


(4) Details

Stage 1. Learning the visual code book

  • dVAE encoder & decoder 학습
    • 즉, $\theta, \phi$에 대해 ELBO를 maximize
  • 오직 image와 관련된 부분 (text (X))


Stage 2. Learning prior

  • Transformer 학습
    • 즉, ($\theta, \phi$는 고정해둔 채) $\varphi$에 대해 ELBO를 maximize
  • Text token & Image token에 대한 prior를 학습함


Transformer: decoder-only model

  • (1) Text-to-text ATT: standard causal mask
  • (2) Image-to-Image ATT: row/col/convolutional attention mask

figure2


3. Visualization

figure2

Categories: , ,

Updated: