16. Deep contextualized word representations (2018)

https://seunghan96.github.io/dl/nlp/14.-nlp-ELMO/ 참고하기

Abstract
Introduction
ELMo : Embeddings from Language Models
1. BiLM
2. ELMo
3. Using biLMs for supervised NLP tasks

Abstract

introduce DEEP CONTEXTUALIZED word representation , that models both

1) complex characteristics of word use
2) how these uses vary across linguistic contexts

learned using biLM (bidirectional language model), which is pre-trained on a large text corpus

1. Introduction

Proposed method : ELMo ( Embeddings from Language Models )

“function of an ENTIRE INPUT SENTENCE”
DEEP, as they are a function of ALL of the internal layers of the biLM

( combining internal states allows for very rich word representation )

2. ELMo : Embeddings from Language Models

functions of entire input sentence
computed on top of 2-layer biLMs, with character convolutions
allows us to do semi-supervised learning

( where biLM is pretrained at a large scale )

2-1. BiLM

(1) Forward LM

\(p\left(t_{1}, t_{2}, \ldots, t_{N}\right)=\prod_{k=1}^{N} p\left(t_{k} \mid t_{1}, t_{2}, \ldots, t_{k-1}\right)\).
at each position \(k\), each LSTM layer outputs a context-dependent representation \(\overrightarrow{\mathbf{h}}_{k, j}^{L M}\),

( where \(j=1, \ldots, L\) )
top layer LSTM output ( = \(\overrightarrow{\mathbf{h}}_{k, L}^{L M}\) ) is used to predict next token \(t_{k+1}\) with softmax

(2) Backward LM

\(p\left(t_{1}, t_{2}, \ldots, t_{N}\right)=\prod_{k=1}^{N} p\left(t_{k} \mid t_{k+1}, t_{k+2}, \ldots, t_{N}\right)\).

biLM combines both (1) & (2)

\(\rightarrow\) jointly maximize log probability,

\(\begin{array}{l} \sum_{k=1}^{N}\left(\log p\left(t_{k} \mid t_{1}, \ldots, t_{k-1} ; \Theta_{x}, \vec{\Theta}_{L S T M}, \Theta_{s}\right) +\log p\left(t_{k} \mid t_{k+1}, \ldots, t_{N} ; \Theta_{x}, \overleftarrow{\Theta}_{L S T M}, \Theta_{s}\right)\right) \end{array}\).

2-2. ELMo

task specific combination of intermediate layer representations in biLM
\(2L+1\) representations!

\(\begin{aligned} R_{k} &=\left\{\mathbf{x}_{k}^{L M}, \overrightarrow{\mathbf{h}}_{k, j}^{L M}, \overleftarrow{\mathbf{h}}_{k, j}^{L M} \mid j=1, \ldots, L\right\} \\ &=\left\{\mathbf{h}_{k, j}^{L M} \mid j=0, \ldots, L\right\} \end{aligned}\).
- \(\mathbf{h}_{k, 0}^{L M}\) : token layer
- \(\mathbf{h}_{k, j}^{L M}=\left[\overrightarrow{\mathbf{h}}_{k, j}^{L M} ; \overleftarrow{\mathbf{h}}_{k, j}^{L M}\right]\).

Compute a task-specific weighting of all biLM layers :

\(\mathbf{E L M o}_{k}^{\text {task }}=E\left(R_{k} ; \Theta^{\text {task }}\right)=\gamma^{\text {task }} \sum_{j=0}^{L} s_{j}^{\text {task }} \mathbf{h}_{k, j}^{L M}\).

\(\mathrm{s}^{\text {task }}\) : softmax-normalized weights
\(\gamma^{\text {task }}\) : scalar parameter

2-3. Using biLMs for supervised NLP tasks

Given pre-trained biLM & supervised architecture for target NLP task, it is a simple process to use the biLM to improve the task model!

Simply run the biLM and record all of the layer representations for each word

Let end task model learn a linear combination of these representations!

Twitter Facebook LinkedIn

44.(paper) 16.Deep contextualized word representations

Seunghan Lee