Large Language Models: A Survey (Part 2)

https://arxiv.org/pdf/2402.06196


2. LLM

(2) LLM Families

P1) LLM

  • Transformer-based PLM ( ~ hundreds of billions of parameters )
  • Stronger language understanding and generation (vs. PLM)
  • Emergent abilities that are not present in smaller-scale models
  • Three LLM families
    • (1) GPT
    • (2) LLaMA
    • (3) PaLM

figure2


P2) The GPT Family

Definition = Family of decoder-only Transformer-based language models ( by OpenAI )

  • GPT-1, GPT-2, GPT-3, InstrucGPT, ChatGPT, GPT-4, CODEX, and WebGPT…
  • [1] Early models
    • GPT-1 and GPT-2
    • Open Source
  • [2] Recent models
    • GPT-3 and GPT-4
    • Closed source ( Only be accessed via APIs )


P2-1) GPT3

  • 175 B params
  • Widely considered as the first LLM in that ..
    • (1) Much larger than previous PLMs,
    • (2) First time demonstrates emergent abilities
  • Shows the emergent ability of “in-context learning”
    • (1) To any downstream tasks without any gradient updates or fine-tuning
    • (2) Demonstrations specified purely via text interaction with the model.
  • Strong performance on many NLP tasks


figure2

  • Function of the number of examples in in-context prompts


P2-2) CODEX

  • Descendant of GPT-3 (by OpenAI in March 2023)

  • General-purpose “programming” model

    • Parse natural language & Generate code
  • Fine-tuned for programming applications on code corpora collected from GitHub

    \(\rightarrow\) CODEX powers Microsoft’s GitHub Copilot


P2-3) WebGPT

  • Descendant of GPT-3

  • Fine-tuned to answer open-ended questions using a text-based web browser

    \(\rightarrow\) Facilitating users to search and navigate the web.

  • Trained in 3 steps

    • Step 1) Learn to mimic human browsing behaviors
      • Using human demonstration data.
    • Step 2) Reward function
      • Learned to predict human preferences
    • Step 3) Reinforcement learning & Rejection sampling
      • Refined to optimize the reward function


P2-4) InstructGPT

  • Goal: To follow expected “human instructions”
  • Align LLM with user intent on a wide range of tasks by fine-tuning with human feedback
  • Reinforcement Learning from Human Feedback (RLHF)

figure2


P2-5) ChatGPT

  • Most important milestone of LLM development (November 30, 2022)
  • Chatbot: Wide range of tasks (such as question answering, information seeking, text summarization…)
  • Powered by GPT-3.5 (and later by GPT-4) (= a sibling model to InstructGPT)


P2-6) GPT-4

  • Multimodal LLM (Launched in March, 2023)
    • Input: Image and text
    • Output: Text outputs.
  • Exhibits human-level performance on various benchmarks
  • (Same as early GPT models)
    • Step 1) Pre-trained to predict next tokens
    • Step 2) Fine-tuned with RLHF to align model behaviors with human-desired ones

figure2


P3) The LLaMA Family (by Meta)

LLaMA models are “open-source”

\(\rightarrow\) Grows rapidly!


P3-1) LLaMA

( LLaMA: Open and Efficient Foundation Language Models )

https://arxiv.org/pdf/2302.13971

  • [Date] February 2023
  • [Size] Ranging from 7B to 65B params
  • [Data] Pre-trained on trillions of tokens ( from publicly available datasets )
  • [Arch] Transformer architecture of GPT-3 + \(\alpha\)
  • Difference?
    • (1) ReLU \(\rightarrow\) SwiGLU
    • (2) Absolute positional embedding \(\rightarrow\) Rotary positional embeddings
    • (3) Standard LN \(\rightarrow\) Root-mean-squared LN
  • Result:
    • (open-source) LLaMA-13B > (proprietary) GPT-3 (175B) on most benchmarks

figure2

figure2


Rotary positional embedding

  • https://www.youtube.com/watch?v=o29P0Kpobz0

figure2

figure2


P3-2) LLaMA-2

( LLaMA 2: Open Foundation and Fine-Tuned Chat Models )

https://arxiv.org/pdf/2307.09288

  • by Meta (+ partner Microsoft )

  • Include both

    • (1) Foundation language models
    • (2) Chat models (finetuned for dialog) \(\rightarrow\) LLaMA-2 Chat


figure2

  • Step 1) Pretrain LLaMA-2 with publicly available online data
  • Step 2) Supervised fine-tuning (SFT)
  • Step 3) Iteratively refined using…
    • (1) RLHF
    • (2) Rejection sampling
    • (3) Proximal policy optimization (PPO)


P3-3) Alpaca

  • Fine-tuned from the (1) LLaMA-7B model with (2) 52K instruction-following demonstrations
  • 52K instruction-following demonstrations
    • Generated in the style of self-instruct using GPT-3.5 (text-davinci-003)
  • Very cost-effective for training (especially for academic research)
  • (Self-instruct evaluation set) Alpaca performs similarly to GPT-3.5, despite that Alpaca is much smaller


P3-4) Vicuna

  • Vicuna13B: 13B chat model
    • By fine-tuning LLaMA on user-shared conversations collected from ShareGPT
  • Evaluation
    • (Evaluator = GPT4) Vicuna-13B achieves more than 90% quality of OpenAI’s ChatGPT and Google’s Bard
  • Training cost of Vicuna-13B is merely $$300!


P3-5) Guanaco

  • Also finetuned LLaMA models using instruction-following data
  • Efficient fine-tuning
    • Finetuning is done very efficiently using QLoRA
    • Finetuning a 65B parameter model can be done on a single 48GB GPU
    • QLoRA: Back-propagates gradients through a frozen, 4-bit quantized PLM into LoRA
  • Result: Best Guanaco model
    • Outperforms all previously released models on the Vicuna benchmark
      • 99.3% of the performance level of ChatGPT ( with much lighter model )


P3-6) Koala

  • Another instruction-following language model built on LLaMA
    • Specific focus on interaction data that include user inputs and responses generated by highly capable closed-source chat models (e.g., ChatGPT)


P3-7) Mistral-7B

  • Mistral-7B: Engineered for superior performance & efficiency
  • (1) Grouped-query attention \(\rightarrow\) For faster inference
  • (2) Sliding window attention \(\rightarrow\) To effectively handle sequences of arbitrary length with a reduced inference cost
  • Outperforms the best open-source
    • 13B model (LLaMA-2-13B) across all evaluated benchmarks
    • 34B model (LLaMA-34B) in reasoning, mathematics, and code generation.


P3-8) Summary of LLaMA

  • LLaMA or LLaMA2, including Code LLaMA [66], Gorilla [67], Giraffe [68], Vigogne [69], Tulu 65B [70], Long LLaMA [71], and Stable Beluga2 [72], just to name a few.


P4) The PaLM Family

PaLM (Pathways Language Model) … by Google

First PaLM model: April 2022

  • [Size] 540B params
  • [Dataset] High-quality text corpus consisting of 780B tokens
  • [GPU] 6144 TPU v4 chips using the Pathways system
    • Enables highly efficient training across multiple TPU Pods


P4-1) U-PaLM

  • U-PaLM models of 8B, 62B, 540B

    = Trained on PaLM with UL2R

    • A method of continue training LLMs on a few steps with UL2’s mixture-of-denoiser objective
    • Approximately 2x computational saving


P4-2) Flan-PaLM

  • Flan-PaLM = U-PaLM + Instruction finetuning
    • Finetuning: Much larger number of tasks, larger model sizes, and chain-ofthought data
  • Result: Substantially outperforms previous instruction-following models
  • Fine-tuning data: comprises 473 datasets, 146 task categories, and 1,836 total tasks

figure2


P4-3) PaLM-2

  • More compute-efficient LLM
  • With better multilingual and reasoning capabilities


P4-4) Med-PaLM

  • Domain-specific PaLM
    • Finetuned on PaLM using instruction prompt tuning
  • Designed to provide high-quality answers to medical questions


(4) Other Representative LLMs

P1) FLAN

  • Simple method for improving the zero-shot learning abilities

    \(\rightarrow\) Showed that instruction tuning LMs on a collection of datasets substantially improves zero-shot performance

  • Instruction-tuned model, “FLAN”

    • Pretrained LM with 137B params.
    • Instruction tune it on over 60 NLP datasets
      • verbalized via natural language instruction templates

figure2


P2) Gopher

  • Analysis of Transformer-based LMs across a wide range of model scales
  • Gopher = 280B params model
  • Evaluated on 152 diverse tasks

figure2


P3) T0

  • Easily mapping any natural language tasks into a human-readable prompted form

  • Convert (a) \(\rightarrow\) (b)

    • (a) Supervised datasets
    • (b) Multiple prompts with diverse wording

    \(\rightarrow\)These prompted datasets allow for benchmarking the ability of a model to perform completely held-out tasks


P4) ERNIE 3.0

  • Unified framework for pre-training large-scale knowledge enhanced models
    • AR model + AE model
  • Tailored for both natural language (1) understanding & (2) generation tasks
  • ERNIE 3.0 = 10B params + 4TB

figure2


P5) RETRO (Retrieval Enhanced Transformer)

  • Enhanced AR model

    • By conditioning on document chunks retrieved from a large corpus

      ( based on local similarity with preceding tokens )

  • Frozen Bert retriever & Differentiable encoder & Chunked cross-attention mechanism

    \(\rightarrow\) Predict tokens based on an order of magnitude with more data than what is typically consumed during training.

figure2


P6) GLaM (Generalist Language Model)

  • Sparsely activated MoE
    • To scale the model capacity & Incurring substantially less training cost
  • Largest GLaM = 1.2T parameters ( = 7x larger than GPT3 )
    • 1/3 of the energy used to train GPT-3
    • 1/2 of the computation FLOPs for inference

figure2


P7) LaMDA

  • Transformer-based models specialized for dialog

    • Up to 137B params & pre-trained on 1.56T words of public dialog data and web text
  • Findings: Fine-tuning with annotated data & enabling the model to consult external knowledge sources

    \(\rightarrow\) Lead to significant improvements towards the two key challenges of safety and factual grounding


P8) OPT (Open Pre-trained Transformers)

  • Decoder-only pre-trained transformers
    • params: 125M ~ 175B

figure2


P9) Chinchilla

  • Investigated the optimal model size and number of tokens under a given compute budget
  • Experimental settings
    • Over 400 language models (70M~16B parmas + 5~5B tokens)
  • Findings: model size & number of training tokens should be scaled equally
  • Chinchilla = Compute-optimal model
    • Same compute budget as Gopher but with 70B parameters and 4% more data


P10) Galactica

  • LLM that can store, combine and reason about scientific knowledge
  • Dataset: Large scientific corpus of papers, reference material, knowledge bases …
  • Experiments: Outperform …
    • Chinchilla on mathematical MMLU: by 41.3% to 35.7%
    • PaLM 540B on MATH: with a score of 20.4% versus 8.8%


P11) CodeGen

  • Family of LLMs up to 16.1B params

  • Dataset:

    • (1) Natural language
    • (2) Programming language data
    • (3) Open sourced the training library JAXFORMER
  • Competitive with the previous SOTA on zero-shot Python code generation on HumanEval.

  • Multi-step paradigm for program synthesis

    = Single program is factorized into multiple prompts specifying sub-problems

  • Constructed an open benchmark: Multi-Turn Programming Benchmark (MTPB)

    • Consisting of 115 diverse problem sets that are factorized into multi-turn prompts


P12) AlextaTM (Alexa Teacher Model)

  • Demonstrated that multilingual seq2seq models, pre-trained on a mixture of denoising and Causal Language Modeling (CLM) tasks, are more efficient few-shot learners than decoder-only models on various task!


P13) Sparrow

  • Information-seeking dialogue agent
    • More helpful, correct, and harmless compared to prompted language model baselines
  • Use RLHF

figure2


P14) Minerva

  • Pretrained on general natural language data

  • Further trained on technical content

    \(\rightarrow\) To tackle previous LLM struggle with quantitative reasoning

    ( e.g., mathematics, science, and engineering problems )


P15) MoD (Mixture-of-Denoisers)

  • Unified perspective for self-supervision in NLP

  • Findings

    • How different pre-training objectives can be cast as one another
    • How interpolating between different objectives can be effective
  • Mixture-of-Denoisers (MoD)

    • Pretraining objective = Combines diverse pre-training paradigms

      \(\rightarrow\) This framework is known as Unifying Language Learning (UL2)

figure2


P16) BLOOM

  • [Model] Decoder-only Transformer
  • [Size] 176B
  • [Dataset] ROOTS corpus
    • Hundreds of sources in 46 natural and 13 programming languages (59 in total)

figure2


P17) GLM

  • GLM-130B: Bilingual (English and Chinese) pre-trained LLM


P18) Pythia

  • Suite of 16 LLMs (70M ~ 12B params)
  • Trained on public data seen in the exact same order
  • Public access to 154 checkpoints for each one of the 16 models


P19) Orca

  • 13B parameter model
  • Imitate the reasoning process of large foundation models
  • Learns from rich signals from GPT-4
    • e.g., explanation traces, step-by-step thought processes, ..


P20) StarCoder

StarCoder & StarCoderBase

  • 15.5B parameter models with 8K context length

  • Infilling capabilities

  • Fast large-batch inference enabled by multi-query attention

  • [Dataset]

    • StarCoderBase: 1T tokens sourced from The Stack

      ( = Large collection of permissively licensed GitHub repositories )

    • StarCoder: StarCoderBase + (fine-tune) 35B Python tokens


P21) KOSMOS

  • Multimodal LLM (MLLM): Can perceive general modalities

  • Trained from scratch

    • On web-scale multi-modal corpora

      ( including arbitrarily interleaved text and images, image-caption pairs, and text data )

  • Impressive performance on …

    • (1) Language understanding, generation, and even OCR-free NLP
    • (2) Perception-language tasks
      • e.g., Multimodal dialogue, image captioning, visual question answering
    • (3) Vision tasks
      • e.g., Image recognition with descriptions


P22) Gemini

  • Multimodal LLM (MLLM): Can perceive general modalities

    \(\rightarrow\) Promising. capabilities across image, audio, video, and text understanding

  • Built on top of Transformer decoders
  • Support 32k context length (via using efficient attention mechanisms).

  • Three versions
    • (1) Ultra: for highly-complex tasks
    • (2) Pro: for enhanced performance and deployability at scale
    • (3) Nano: for on-device applications


P23) Overview of some of the most representative LLM frameworks

figure2


Categories: , ,

Updated: