CodeFusion: A Pre-trained Diffusion Model for Code Generation

Singh, Mukul, et al. "Codefusion: A pre-trained diffusion model for code generation." EMNLP 2023

참고:

Human vs. Model (in coding)
Recap of Diffusion Models
CodeFusion Architecture
Training Process
1. Phase 1: Unsupervised pretraining
2. Phase 2: Supervised fine-tuning
Experiments

1. Human vs. Model (in coding)

Human: Often reach a point where we decide to start writing some piece of code from scratch.
Model: Has one chance to get the implementation right

\(\rightarrow\) The model has no easy way to reconsider tokens it already generated

\(\rightarrow\) CodeFusion: Tackles this limitation by letting the model to **revise its implementation **in multiple iterations.

CodeFusion: Diffusion model for code generation

Recap of diffusion models

Backbone of the top text-to-image generation models

(e.g., DALL-E, Stable Diffusion, Imagen … )
Input) Prompt: “A cat is sitting on a laptop”.
Process) Gradually remove noise from an image
- Step 1) Starts with a random noise image
- Step 2) Each step it removes some of the noise
  - Noise removal is conditioned on the input prompt
  - Noise removal process usually takes between 10s to 1000s of steps
    
    \(\rightarrow\) Latency drawback.

[Key Idea] Allow the model to reconsider its solution in each denoising step

\(\rightarrow\) Mitigating the limitation explained in the beginning of the post

( = Code LLMs cannot easily reconsider tokens that were already generated )

Step 1) Encoding

( = Prompt is passed via a (transformer-based) encoder )

Step 2) Denoising

( = The embeddings are passed to (transformer-based) denoiser )

Input of denoiser = (1) embeddings + (2) random noise in a latent space
Multiple iterations of gradually removing the noise ( conditioned on the embeddings )
- In the latent space ( not in the data space )
Ends with \(x_0\) ( = Representation of the final denoiser in the latent space )

Step 3) Decoding

( = Into discrete code tokens )

(Before projection to classificaiton head) \(x_0\) is passed together with the **prompt embedding \(\mathbf{E}\)

Step 4) Classification

Dataset = contain code snippets only ( without prompts )
Train only the denoiser and decoder

( Missing prompt embedding is replaced with a random noise )