18. BERT ; Pre-training of Deep Bidirectional Transformers for Language Understanding (2019)

( https://seunghan96.github.io/dl/nlp/28.-nlp-BERT-%EC%9D%B4%EB%A1%A0/ 부터 먼저 읽기 )

Abstract

introduce BERT ( Bidirectional Encoder Representations from Transformer )

1) pre-train deep bidirectional representations from unlabeled text

( by jointly conditioning on both left & right context in all layers )
2) can be fine-tuned with just one additional output layer

LM pretraining is hot!

2 existing strategies for applying pre-trained language representations to down-stream tasks :

1) feature-based
- ex) ELMo uses task-specific architecture,
  
  that include pre-trained representations as additional features
2) fine-tuning
- ex) GPT introduces minimal task-specific parameters &
  
  trained on the downstream tasks by simply fine-tuning all pre-trained parameters

2 approaches share the same objective function during pre-training, where they use UNI-directional LM!

This paper improve the fine-tuning based approaches by proposing BERT

Contributions

2 steps in this framework :

(1) pre-training : trained on unlabeled data, over different pre-training tasks
(2) fine-tuning : first initialized with the pre-trained parameters
- each downstream task has separate fine-tuend models

Distinctive feature of BERT : UNIFIED architecture architecture across different tasks

Model architecture

multi-layer bidirectional Transformer
notation
- \(L\) : number of layers ( Transformer blocks )
- \(H\) : hidden size
- \(A\) : number of self-attention heads
type
- \(\mathbf{B E R T}_{\text {BASE }}(\mathrm{L}=12, \mathrm{H}=768, \mathrm{~A}=12,\) Total Parameters \(=110 \mathrm{M}\) )
- \(\mathbf{B E R T}_{\text {LARGE }}(\mathrm{L}=24, \mathrm{H}=1024\), \(\mathrm{A}=16\), Total Parameters \(=340 \mathrm{M}\) ).

BERT vs GPT

Input & Output Representations

Task 1) Masked LM

Task 2) Next Sentence Prediction (NSP)

Self-attention mechanism in the Transformer allows BERT to model many downstream tasks
Bidirectional cross attention between two sentences!
For each task, simply plug in the task-specific inputs & outputs into BERT & fine-tune all the parameters end-to-end!
(compared to pre-training) fine-tuning is relatively inexpensive