Large Language Models: A Survey (Part 3)

https://arxiv.org/pdf/2402.06196

3. How LLMs Are Built

P1) Introduction

Popular architectures used for LLMs

Data and modeling techniques

Data preparation
Tokenization
Pre-training
Instruction tuning
Alignment

P2) Major steps in training an LLM

Step 1) Data preparation (collection, cleaning, deduping, etc.)
Step 2) Tokenization
Step 3) Model pretraining (in a SSL fashion)
Step 4) Instruction tuning
Step 5) Alignment

(1) Dominant LLM Architecture

P1) Three types

Most of them are based on Transformer

Encoder-only
Decoder-only
Encoder-decoder

P2) Arch: Transformer

Pass!

P2-1)

Pass

P3) Arch: Encoder-Only

Model: Attention layers can access all the words in the initial sentence
Pretraining: MLM
Experiments: Understanding of the full sequence
- e.g., Sentence classification, named entity recognition, and extractive question answering.
Ex) BERT

P4) Arch: Decoder-Only

Model: Attention layers can only access the words positioned before that in the sentence

( = Autoregressive model )
Pretraining: NTP
Experiments: Text generation
Ex) GPT

P5) Arch: Encoder-Decoder

(Also called sequence-to-sequence models)

Model:
- Encoder: Can access all the words in the initial sentence
- Decoder: Can only accesses the words positioned before a given word in the input
Pretrained: Using the objectives of encoder or decoder models
- e.g., Replacing random spans of text with a single mask special word
  
  $\rightarrow$ Predict the text that this mask word replaces
Experiments: Generating new sentences conditioned on a given input
- e.g., Summarization, translation, or generative question answering

(2) Data Cleaning

P1) Data quality is crucial!

Data cleaning techniques

e.g., Filtering, deduplication

$\rightarrow$ Big impact on the model performance

P2) Falcon40B

Properly filtered and deduplicated web data $\rightarrow$ Lead to powerful models!

( + Despite extensive filtering, they were able to obtain five trillion tokens from CommonCrawl )

P3) Data Filtering

Enhance the quality of training data & effectiveness of the trained LLMs

Common data filtering techniques include:

(1) Removing noise
(2) Handling Outliers
(3) Addressing Imbalances
(4) Text Preprocessing
(5)

P3-1) Removing Noise

Ex) Removing false information from the training data

Two mainstream approaches:

(1) Classifier-based
(2) Heuristic-based

P3-2) Handling Outliers

Prevent them from disproportionately influencing the model.

P3-3) Addressing Imbalances

To avoid biases and ensure fair representation

P3-4) Text Preprocessing

Cleaning and standardizing text data

By removing stop words, punctuation, …

P3-5) Dealing with Ambiguities

Resolving or excluding ambiguous or contradictory data

$\rightarrow$ Help the model to provide more definite and reliable answers

P4) Deduplication

Duplicate data: Introduce biases & Reduce the diversity

$\rightarrow$ Remove duplicate instances or repeated occurrences

P4-1) Importance of deduplication

Particularly important when dealing with large datasets!

$\rightarrow$ Unintentionally inflate the importance of certain patterns or characteristics.

P4-2) De-duplication method

Vary based on the nature of the data

Ex) Comparing entire data points or specific features
Ex) (Document level) Overlap ratio of high-level features (e.g. n-grams overlap) between documents

(3) Tokenizations

P1) What is Tokenization?

Converting a sequence of text into smaller parts ( = tokens )
Three popular tokenizers
- (1) BytePairEncoding
- (2) WordPieceEncoding
- (3) SentencePieceEncoding

P2) BytePairEncoding

(Originally) Data compression algorithm

$\rightarrow$ Uses frequent patterns at byte level to compress the data
Pros)
- (1) Simple
- (2) Keeps the vocabulary not very large
- (3) Good enough to represent common words at the same time

P3) WordPieceEncoding

Mainly used for very well-known models (e.g., BERT, Electra)
Similar to BPE but merges tokens based on likelihood (as in language modeling)

P4) SentencePieceEncoding

BPE & WPE: Take assumption of words being always separated by white-space

$\rightarrow$ Not always true!
SPE: Works on raw text including spaces

(4) Positional Encoding

P1) Absolute Positional Embeddings

Original Transformer model
Learned vs. Fixed
Main drawbacks
- (1) Restriction to a certain number of tokens
- (2) Fails to account for the relative distances between tokens

P2) Relative Positional Embeddings

To take into account the pairwise links between input tokens
RPE is added to the model at two levels
- (1) Additional component to the keys
- (2) Sub-component of the values matrix

P3) Rotary Position Embeddings (RoPE)

Rotation matrix to ..
- (1) Encode the absolute position of words
- (2) Include explicit relative position details in self-attention.
Flexibility with context lengths
e.g., GPT-NeoX-20B, PaLM, CODEGEN, and LLaMA

(5) Model Pre-training

P1) Pretraining

Prained on a massive amount of (usually) unlabeled texts in SSL mannerl

Next sentence prediction (NSP)
Next token prediction (NTP)
Masked language modeling (MLM)

P2) Next token prediction (NTP)

$\mathscr{L}{A L M}(x)=\sum{i=1}^N p\left(x_{i+n} \mid x_i, \ldots, x_{i+n-1}\right)$.

P3) Masked language modeling (MLM)

$\mathscr{L}{M L M}(x)=\sum{i=1}^N p(\bar{x} \mid x \backslash \bar{x})$.

P4) Mixture of Experts (MoE)

Enable models to be pre-trained with much less compute

$\rightarrow$ $\therefore$ Can dramatically scale up the model or dataset size with the same compute budget
Two main elements:
- (1) Sparse MoE layers
  - Used instead of FFN
  - Have certain number of “experts” (=NN)
- (2) Gate network or router
  - Determines which tokens are sent to which expert

(6) Fine-tuning and Instruction Tuning

P1) Necessity of fine-tuning

Fine-tuned to a specific task with labeled data (= SFT)
- e.g., BERT: Finetuned to 11 different tasks
While more recent LLMs no longer require fine-tuning to be used…

$\rightarrow$ they can still benefit from task or data-specific fine-tuning
- e.g., (Much smaller) GPT-3.5 Turbo model + fine-tune > GPT-4

P2) Multi-task fine-tuning

Does not need to be performed to a single task!
Various approaches to multi-task fine-tuning
- Improve results & Reduce the complexity of prompt engineering
- Alternative to RAG
Ex) Fine-tune to expose the model to new data that has not been exposed to during pre-training

P3) Instruction Tuning

Instruction = Prompt that specifies the task (that the LLM should accomplish)
To align the responses to the expectations that humans!
- Especially, when providing instructions through prompts!

P4) Importance of Instruction Tuning

Instruction datasets varies by LLM!
Instruction tuned models > Original foundation models
- e.g., InstructGPT > GPT-3
- e.g., Alpaca > LLaMA

P5) Self-Instruct

Popular approach in instruction tuning
Framework for improving the instruction-following capabilities of pre-trained LM by bootstrapping their OWN generations
Procedure
- Step 1) Generates (instructions, input, and output) samples with LM ( = itself )
- Step 2) Filters invalid or similar ones
- Step 3) Fine tune the original model with them

(7) Alignment

P1) What is alignment?

Steering AI systems towards human goals, preferences, and principles
LLM = Often exhibit unintended behaviors :(

( e.g., toxic, harmful, misleading and biased )

P2) Alignment & Instruction tuning

Instruction tuning = Makes LLMs to be aligned
Important to include further steps to **improve the alignment of the model **and avoid unintended behaviors!!
Most popular approaches
- (1) RLHF, RLAIF
- (2) DPO
- (3) KTO

P3) RLHF & RLAIF

RLHF (reinforcement learning from human feedback)
- Uses reward model to learn alignment from human feedback!
- Procedure
  - (1) LM generates multiple output
  - (2) Reward model rates multiple outputs & scores them (based on preferences given by humans)
  - (3) Feward model gives feedback to the LLM
  - (4) Feedback is used to tune the LLM
- e.g., OpenAI-ChatGPT, Anthropic-Claude, Google-Gemini
RLAIF (Reinforcement learning from AI feedback)
- Preference (Evaluation) by AI (=Model)

P4) Direct Preference Optimization (DPO)

No need for reward model & PPO!

Limitation of RLHF: Complex and often unstable!
DPO = Stable, performant, and computationally lightweight

$\rightarrow$ Eliminating the need for fitting a reward model, sampling from the LM during finetuning, or performing significant hyperparameter tuning!

P5) Kahneman-Tversky Optimization (KTO)

Does not require paired preference data $(x, y_1, y_2)$
Only needs $(x,y)$ & knowledge of whether $y$ is desirable or undesirable
Better than DPO-aligned models (at scales from 1B to 30B)
Far easier to use in the real world, as the kind of data it needs is far more abundant!
- e.g., Purchase data = successful (purchase O) & unsuccessful (purchase X)

(8) Decoding Strategies

P1) Decoding

Decoding = Process of text generation using pretrained LLMs
Procedure
- Step 1) LLM generates logits
- Step 2) Logits are converted to probabilities using a softmax function
- Step 3) Various decoding strategies
  - e.g., Greedy search, beam search, as well as different sample techniques such as top-K, top-P (Nucleus sampling).

P2) Greedy Search

pass

P3) Beam Search

pass

P4) Top-k Sampling

Low temprature = Creativity
High temperature = Priority

(9) Cost-Effective Training/Inference/Adaptation/Compression

P1) Optimized Training

Various frameworks for optimized training of LLMs

P1-1) Zero Redundancy Optimizer (ZeRO)

Goal: To optimize memory

Vastly improving training speed of LLMs, while increasing the model size
Eliminates memory redundancies in data- and model-parallel training
Low communication volume and high computational granularity

P1-2) Receptance Weighted Key Value (RWKV)

Combines the ..

(1) Efficient parallelizable training of Transformers
(2) Efficient inference of RNNs

Leverages a linear attention mechanism & Allows them to formulate the model as either a Transformer or an RNN

P2) Low-Rank Adaption (LoRA)

Can be applied to any a subset of weight matrices in a NN
- e.g., Self-attention module $\left(W_q, W_k, W_v, W_o\right)$, & two in the MLP module
Mostly focused on adapting the “attention weights”only for downstream tasks

( Freezes the MLP modules, so they are not trained in downstream tasks both for simplicity and parameter-efficiency )

P3) Knowledge Distillation

P3-1) Various types of distillation

Response distillation
- Follow the output of the teacher model!
- Tries to teach the student model how to similariy perform as teacher
Feature distillation
- Follow the representation of the teacher model!
- Not only the last layer, but also intermediate layers
API distillation
- Process of using an API (typically from an LLM provider such as OpenAI) to train smaller models

P4) Quantization

Reducing the precision of the weights $\rightarrow$ Reduce the size of the model $\rightarrow$ Faster
e.g., FP32, FP16, NF16 …
Main approaches for model quantization
- (1) Post training quantization
- (2) Quantization-aware training

Twitter Facebook LinkedIn

Large Language Models; A Survey (Part 3)

Seunghan Lee