Self-Rewarding Language Models

Yuan, Weizhe, et al. "Self-rewarding language models." ICML 2024

참고:

https://aipapersacademy.com/self-rewarding-language-models/
https://arxiv.org/pdf/2401.10020

Motivation
Background
1. RLHF & RLAIF
2. DPO
Self-Rewarding Language Models
1. Key Idea
2. Datasets for Training Self-rewarding LLMs
3. Improving the Model Iteratively
Experiments

1. Motivation

Pre-trained LLMs

Improved by getting feedback about the model output from humans
Then, train on that feedback!

This paper

In order to reach superhuman agents, future models require superhuman feedback!

Previous works

There are already methods that use LLMs to be the ones the provide the feedback

e.g, RLAIF (Reinforcement Learning from AI Feedback)

Then, what’s novel with this work?

\(\rightarrow\) The model providing the feedback for the model outputs is actually the SAME model!

2. Background

Previous works

Reinforcement learning from human feedback (RLHF)
Reinforcement learning from AI feedback (RLAIF)

(1) RLHF & RLAIF

Procedure

Step 1) Pre-trained LLMs: Generate responses
Step 2) **Responses are evaluated **by human/AI (RLHF, RLAIF)

\(\rightarrow\) Produce a dataset of “evaluated responses”
Step 3) Train a reward model based on that data
- A model that learns to rank the LLM responses
Step 4) LLM is trained using the reward model
- Mostly via PPO (to yield outputs with high ranking)

(2) DPO

Direct Preference Optimization (DPO)

The need for the reward model is removed

\(\rightarrow\) Train the LLM using the feedback data directly (w/o reward model)

3. Self-Rewarding Language Models

(1) Key Idea

Self-rewarding LLM : should both learn to ..

(1) Follow instructions
(2) Act as a reward model

\(\rightarrow\) That is, the LLM is used to generate responses and their evaluations,

( Keeps on training on these evaluated responses iteratively )

(2) Datasets for Training Self-rewarding LLMs

Start with..

(1) Pre-trained LLM = \(M_0\)
(2) Two initial small datasets = IFT + EFT

a) Instruction fine-tuning (IFT)

Instruction following dataset

Crafted by humans

b) Evaluation fine-tuning (EFT)

Samples that have an …

(1) Evaluation instruction prompt
- Prompt that asks the model to evaluate the quality of a given response to a particular instruction.
(2) Evaluation result response
- Determines the score of the response, with reasoning for that decision.

\(\rightarrow\) Serves as training data for the LLM to fill the role of a reward model

(3) Improving the Model Iteratively

Self-alignment process that consist of iterations

Each iteration has two phases

4. Experiments

\(\rightarrow\) The more advanced versions win previous versions of the model!

Twitter Facebook LinkedIn

Self-Rewarding Language Models

Seunghan Lee

Self-Rewarding Language Models

Contents

1. Motivation

Previous works

2. Background

(1) RLHF & RLAIF

(2) DPO

3. Self-Rewarding Language Models

(1) Key Idea

(2) Datasets for Training Self-rewarding LLMs

a) Instruction fine-tuning (IFT)

b) Evaluation fine-tuning (EFT)

(3) Improving the Model Iteratively

4. Experiments

You May Also Enjoy