Large Language Models: A Survey (Part 3)

https://arxiv.org/pdf/2402.06196


3. How LLMs Are Built

P1) Introduction

Popular architectures used for LLMs

Data and modeling techniques

  • Data preparation
  • Tokenization
  • Pre-training
  • Instruction tuning
  • Alignment


P2) Major steps in training an LLM

  • Step 1) Data preparation (collection, cleaning, deduping, etc.)
  • Step 2) Tokenization
  • Step 3) Model pretraining (in a SSL fashion)
  • Step 4) Instruction tuning
  • Step 5) Alignment

figure2


(1) Dominant LLM Architecture

P1) Three types

Most of them are based on Transformer

  • Encoder-only
  • Decoder-only
  • Encoder-decoder


P2) Arch: Transformer

  • Pass!


P2-1)

  • Pass


P3) Arch: Encoder-Only

  • Model: Attention layers can access all the words in the initial sentence
  • Pretraining: MLM
  • Experiments: Understanding of the full sequence
    • e.g., Sentence classification, named entity recognition, and extractive question answering.
  • Ex) BERT


P4) Arch: Decoder-Only

  • Model: Attention layers can only access the words positioned before that in the sentence

    ( = Autoregressive model )

  • Pretraining: NTP

  • Experiments: Text generation

  • Ex) GPT


P5) Arch: Encoder-Decoder

(Also called sequence-to-sequence models)

  • Model:

    • Encoder: Can access all the words in the initial sentence
    • Decoder: Can only accesses the words positioned before a given word in the input
  • Pretrained: Using the objectives of encoder or decoder models

    • e.g., Replacing random spans of text with a single mask special word

      $\rightarrow$ Predict the text that this mask word replaces

  • Experiments: Generating new sentences conditioned on a given input

    • e.g., Summarization, translation, or generative question answering


(2) Data Cleaning

P1) Data quality is crucial!

Data cleaning techniques

  • e.g., Filtering, deduplication

$\rightarrow$ Big impact on the model performance


P2) Falcon40B

  • Properly filtered and deduplicated web data $\rightarrow$ Lead to powerful models!

    ( + Despite extensive filtering, they were able to obtain five trillion tokens from CommonCrawl )


P3) Data Filtering

Enhance the quality of training data & effectiveness of the trained LLMs

Common data filtering techniques include:

  • (1) Removing noise
  • (2) Handling Outliers
  • (3) Addressing Imbalances
  • (4) Text Preprocessing
  • (5)


P3-1) Removing Noise

Ex) Removing false information from the training data

Two mainstream approaches:

  • (1) Classifier-based
  • (2) Heuristic-based


P3-2) Handling Outliers

Prevent them from disproportionately influencing the model.


P3-3) Addressing Imbalances

To avoid biases and ensure fair representation


P3-4) Text Preprocessing

Cleaning and standardizing text data

  • By removing stop words, punctuation, …


P3-5) Dealing with Ambiguities

Resolving or excluding ambiguous or contradictory data

$\rightarrow$ Help the model to provide more definite and reliable answers


P4) Deduplication

Duplicate data: Introduce biases & Reduce the diversity

$\rightarrow$ Remove duplicate instances or repeated occurrences


P4-1) Importance of deduplication

Particularly important when dealing with large datasets!

$\rightarrow$ Unintentionally inflate the importance of certain patterns or characteristics.


P4-2) De-duplication method

Vary based on the nature of the data

  • Ex) Comparing entire data points or specific features
  • Ex) (Document level) Overlap ratio of high-level features (e.g. n-grams overlap) between documents


(3) Tokenizations

P1) What is Tokenization?

  • Converting a sequence of text into smaller parts ( = tokens )
  • Three popular tokenizers
    • (1) BytePairEncoding
    • (2) WordPieceEncoding
    • (3) SentencePieceEncoding


P2) BytePairEncoding

  • (Originally) Data compression algorithm

    $\rightarrow$ Uses frequent patterns at byte level to compress the data

  • Pros)

    • (1) Simple
    • (2) Keeps the vocabulary not very large
    • (3) Good enough to represent common words at the same time


P3) WordPieceEncoding

  • Mainly used for very well-known models (e.g., BERT, Electra)
  • Similar to BPE but merges tokens based on likelihood (as in language modeling)


P4) SentencePieceEncoding

  • BPE & WPE: Take assumption of words being always separated by white-space

    $\rightarrow$ Not always true!

  • SPE: Works on raw text including spaces


(4) Positional Encoding

P1) Absolute Positional Embeddings

  • Original Transformer model
  • Learned vs. Fixed
  • Main drawbacks
    • (1) Restriction to a certain number of tokens
    • (2) Fails to account for the relative distances between tokens


P2) Relative Positional Embeddings

  • To take into account the pairwise links between input tokens
  • RPE is added to the model at two levels
    • (1) Additional component to the keys
    • (2) Sub-component of the values matrix


P3) Rotary Position Embeddings (RoPE)

  • Rotation matrix to ..
    • (1) Encode the absolute position of words
    • (2) Include explicit relative position details in self-attention.
  • Flexibility with context lengths
  • e.g., GPT-NeoX-20B, PaLM, CODEGEN, and LLaMA


figure2


(5) Model Pre-training

P1) Pretraining

Prained on a massive amount of (usually) unlabeled texts in SSL mannerl

  • Next sentence prediction (NSP)
  • Next token prediction (NTP)
  • Masked language modeling (MLM)


P2) Next token prediction (NTP)

$\mathscr{L}{A L M}(x)=\sum{i=1}^N p\left(x_{i+n} \mid x_i, \ldots, x_{i+n-1}\right)$.


P3) Masked language modeling (MLM)

$\mathscr{L}{M L M}(x)=\sum{i=1}^N p(\bar{x} \mid x \backslash \bar{x})$.


P4) Mixture of Experts (MoE)

  • Enable models to be pre-trained with much less compute

    $\rightarrow$ $\therefore$ Can dramatically scale up the model or dataset size with the same compute budget

  • Two main elements:

    • (1) Sparse MoE layers
      • Used instead of FFN
      • Have certain number of “experts” (=NN)
    • (2) Gate network or router
      • Determines which tokens are sent to which expert


figure2


(6) Fine-tuning and Instruction Tuning

P1) Necessity of fine-tuning

  • Fine-tuned to a specific task with labeled data (= SFT)

    • e.g., BERT: Finetuned to 11 different tasks
  • While more recent LLMs no longer require fine-tuning to be used…

    $\rightarrow$ they can still benefit from task or data-specific fine-tuning

    • e.g., (Much smaller) GPT-3.5 Turbo model + fine-tune > GPT-4


P2) Multi-task fine-tuning

  • Does not need to be performed to a single task!
  • Various approaches to multi-task fine-tuning
    • Improve results & Reduce the complexity of prompt engineering
    • Alternative to RAG
  • Ex) Fine-tune to expose the model to new data that has not been exposed to during pre-training


P3) Instruction Tuning

  • Instruction = Prompt that specifies the task (that the LLM should accomplish)
  • To align the responses to the expectations that humans!
    • Especially, when providing instructions through prompts!


P4) Importance of Instruction Tuning

  • Instruction datasets varies by LLM!
  • Instruction tuned models > Original foundation models
    • e.g., InstructGPT > GPT-3
    • e.g., Alpaca > LLaMA


P5) Self-Instruct

  • Popular approach in instruction tuning
  • Framework for improving the instruction-following capabilities of pre-trained LM by bootstrapping their OWN generations
  • Procedure
    • Step 1) Generates (instructions, input, and output) samples with LM ( = itself )
    • Step 2) Filters invalid or similar ones
    • Step 3) Fine tune the original model with them


(7) Alignment

P1) What is alignment?

  • Steering AI systems towards human goals, preferences, and principles

  • LLM = Often exhibit unintended behaviors :(

    ( e.g., toxic, harmful, misleading and biased )


P2) Alignment & Instruction tuning

  • Instruction tuning = Makes LLMs to be aligned
  • Important to include further steps to **improve the alignment of the model **and avoid unintended behaviors!!

  • Most popular approaches
    • (1) RLHF, RLAIF
    • (2) DPO
    • (3) KTO


P3) RLHF & RLAIF

  • RLHF (reinforcement learning from human feedback)
    • Uses reward model to learn alignment from human feedback!
    • Procedure
      • (1) LM generates multiple output
      • (2) Reward model rates multiple outputs & scores them (based on preferences given by humans)
      • (3) Feward model gives feedback to the LLM
      • (4) Feedback is used to tune the LLM
    • e.g., OpenAI-ChatGPT, Anthropic-Claude, Google-Gemini
  • RLAIF (Reinforcement learning from AI feedback)
    • Preference (Evaluation) by AI (=Model)


P4) Direct Preference Optimization (DPO)

No need for reward model & PPO!

  • Limitation of RLHF: Complex and often unstable!

  • DPO = Stable, performant, and computationally lightweight

    $\rightarrow$ Eliminating the need for fitting a reward model, sampling from the LM during finetuning, or performing significant hyperparameter tuning!


P5) Kahneman-Tversky Optimization (KTO)

  • Does not require paired preference data $(x, y_1, y_2)$
  • Only needs $(x,y)$ & knowledge of whether $y$ is desirable or undesirable
  • Better than DPO-aligned models (at scales from 1B to 30B)
  • Far easier to use in the real world, as the kind of data it needs is far more abundant!
    • e.g., Purchase data = successful (purchase O) & unsuccessful (purchase X)

figure2


(8) Decoding Strategies

P1) Decoding

  • Decoding = Process of text generation using pretrained LLMs
  • Procedure
    • Step 1) LLM generates logits
    • Step 2) Logits are converted to probabilities using a softmax function
    • Step 3) Various decoding strategies
      • e.g., Greedy search, beam search, as well as different sample techniques such as top-K, top-P (Nucleus sampling).


  • pass


  • pass


P4) Top-k Sampling

  • Low temprature = Creativity
  • High temperature = Priority


(9) Cost-Effective Training/Inference/Adaptation/Compression

P1) Optimized Training

Various frameworks for optimized training of LLMs


P1-1) Zero Redundancy Optimizer (ZeRO)

Goal: To optimize memory

  • Vastly improving training speed of LLMs, while increasing the model size

  • Eliminates memory redundancies in data- and model-parallel training

  • Low communication volume and high computational granularity


P1-2) Receptance Weighted Key Value (RWKV)

Combines the ..

  • (1) Efficient parallelizable training of Transformers
  • (2) Efficient inference of RNNs

Leverages a linear attention mechanism & Allows them to formulate the model as either a Transformer or an RNN

figure2


P2) Low-Rank Adaption (LoRA)

figure2

  • Can be applied to any a subset of weight matrices in a NN

    • e.g., Self-attention module $\left(W_q, W_k, W_v, W_o\right)$, & two in the MLP module
  • Mostly focused on adapting the “attention weights”only for downstream tasks

    ( Freezes the MLP modules, so they are not trained in downstream tasks both for simplicity and parameter-efficiency )


P3) Knowledge Distillation

figure2


P3-1) Various types of distillation

  1. Response distillation
    • Follow the output of the teacher model!
    • Tries to teach the student model how to similariy perform as teacher
  2. Feature distillation
    • Follow the representation of the teacher model!
    • Not only the last layer, but also intermediate layers
  3. API distillation
    • Process of using an API (typically from an LLM provider such as OpenAI) to train smaller models


P4) Quantization

  • Reducing the precision of the weights $\rightarrow$ Reduce the size of the model $\rightarrow$ Faster
  • e.g., FP32, FP16, NF16 …
  • Main approaches for model quantization
    • (1) Post training quantization
    • (2) Quantization-aware training

Categories: , ,

Updated: