Code Llama Paper Explained

Roziere, Baptiste, et al. "Code llama: Open foundation models for code." arXiv preprint arXiv:2308.12950 (2023).

참고:

https://aipapersacademy.com/code-llama/
https://arxiv.org/abs/2308.12950

Background
Pipeline
Experiments

1. Background

Code Llama (by Meta AI)

= Family of open-source LLMs for code

Three types of models

Foundation models: Code Llama
Python specialization models: Code Llama – Python
Instruction-following models: Code Llama – Instruct
Each type: 7B, 13B and 34B params.

2. Pipeline

(Step 1) Pretrained model

Step 1) Starts with Llama 2 (trained on general purpos text & code data)

(\(\leftrightarrow\) StarCoder: trained on code only)

(Step 2) Code Training & Infilling Code Training

[1] Code Training

Finetune on a code dataset (500B token)
- Why natural language?
  
  \(\rightarrow\) To keep its natural language understanding skills

[2] Infilling Code Training

( Only for the 7B and 13B versions of Code Llama and Code Llama – Instruct )

LLM: Pretrained with Next Token Prediction
Infilling: The model can get a surrounding context and predict the missing information. \(\rightarrow\) How??

(Step 3) Python Code Training

( Only for the Code Llama – Python model )

Continue training on another dataset of 100B tokens which is targeted for python

(Step 4) Long Context Fine-tuning

( Llama 2: supports a context length of 4,096 tokens )

\(\rightarrow\) With such context length, enable file-level reasoning

But with long context fine-tuning… the context length is increase to 100k

\(\rightarrow\) Feed the model with a full code repository and get repository-level reasoning

Actually fine-tuned with 16k length sequences and not 100k

( But it extrapolates well for sequences up to 100k tokens )

X axis = context length
Y axis = perplexity of the models (PPL)
Dotted line = Context length in fine-tuning which is 16k

\(\rightarrow\) The perplexity keeps going down up to 100k tokens and then starts to go up.

“Lost in the Middle: How Language Models Use Long Contexts” paper

Abstract = It is harder for LLMs to reason based on information in the middle of the context, comparing to information in the beginning/end of the context

\(\rightarrow\) Only the 7B version seems to have a significant drop, when the answer sits in the beginning of the context

(Step 5) Instruction Fine-tuning

( Rather than providing a code context to complete or fill … )

Provide the model with prompt to create a Bash command (with few conditions)

Then, the model yields

(1) the proper command
(2) explanation about each part of the command

\(\rightarrow\) How does this work?

Instruction Fine-tuning with Self-Instruct

Three datasets

(1) Same dataset used for instruction tuning of Llama 2
- Inherit Llama 2’s instruction-following and safety properties
- Does not contain many examples of code-related tasks
(2) Dataset constructed using self-instruct method

Self-instruct?

Step 1) Provide Llama 2 70B with a prompt to write programming interview questions
Step 2) Get 62,000 interview-style programming questions

( \(\rightarrow\) After removing exact duplicates we end with 52,000 questions )
Step 3) For each question, we pass it twice via Code Llama 7B
- (1) Prompt to generate unit tests for the question
- (2) Prompt to generate 10 solutions for the question
Step 4) Run the unit tests on the generated solutions
- Accept the first passing solution
- Put the (question, answer) of the test to self-instruct dataset

3. Experiments

Twitter Facebook LinkedIn

Code Llama

Seunghan Lee