16. Deep contextualized word representations (2018)
https://seunghan96.github.io/dl/nlp/14.-nlp-ELMO/ 참고하기
목차
- Abstract
- Introduction
- ELMo : Embeddings from Language Models
- BiLM
- ELMo
- Using biLMs for supervised NLP tasks
Abstract
introduce DEEP CONTEXTUALIZED word representation , that models both
- 1) complex characteristics of word use
- 2) how these uses vary across linguistic contexts
learned using biLM (bidirectional language model), which is pre-trained on a large text corpus
1. Introduction
Proposed method : ELMo ( Embeddings from Language Models )
-
“function of an ENTIRE INPUT SENTENCE”
-
DEEP, as they are a function of ALL of the internal layers of the biLM
( combining internal states allows for very rich word representation )
2. ELMo : Embeddings from Language Models
-
functions of entire input sentence
-
computed on top of 2-layer biLMs, with character convolutions
-
allows us to do semi-supervised learning
( where biLM is pretrained at a large scale )
2-1. BiLM
(1) Forward LM
-
\(p\left(t_{1}, t_{2}, \ldots, t_{N}\right)=\prod_{k=1}^{N} p\left(t_{k} \mid t_{1}, t_{2}, \ldots, t_{k-1}\right)\).
-
at each position \(k\), each LSTM layer outputs a context-dependent representation \(\overrightarrow{\mathbf{h}}_{k, j}^{L M}\),
( where \(j=1, \ldots, L\) )
-
top layer LSTM output ( = \(\overrightarrow{\mathbf{h}}_{k, L}^{L M}\) ) is used to predict next token \(t_{k+1}\) with softmax
(2) Backward LM
- \(p\left(t_{1}, t_{2}, \ldots, t_{N}\right)=\prod_{k=1}^{N} p\left(t_{k} \mid t_{k+1}, t_{k+2}, \ldots, t_{N}\right)\).
biLM combines both (1) & (2)
\(\rightarrow\) jointly maximize log probability,
\(\begin{array}{l} \sum_{k=1}^{N}\left(\log p\left(t_{k} \mid t_{1}, \ldots, t_{k-1} ; \Theta_{x}, \vec{\Theta}_{L S T M}, \Theta_{s}\right) +\log p\left(t_{k} \mid t_{k+1}, \ldots, t_{N} ; \Theta_{x}, \overleftarrow{\Theta}_{L S T M}, \Theta_{s}\right)\right) \end{array}\).
2-2. ELMo
-
task specific combination of intermediate layer representations in biLM
-
\(2L+1\) representations!
\(\begin{aligned} R_{k} &=\left\{\mathbf{x}_{k}^{L M}, \overrightarrow{\mathbf{h}}_{k, j}^{L M}, \overleftarrow{\mathbf{h}}_{k, j}^{L M} \mid j=1, \ldots, L\right\} \\ &=\left\{\mathbf{h}_{k, j}^{L M} \mid j=0, \ldots, L\right\} \end{aligned}\).
- \(\mathbf{h}_{k, 0}^{L M}\) : token layer
- \(\mathbf{h}_{k, j}^{L M}=\left[\overrightarrow{\mathbf{h}}_{k, j}^{L M} ; \overleftarrow{\mathbf{h}}_{k, j}^{L M}\right]\).
Compute a task-specific weighting of all biLM layers :
\(\mathbf{E L M o}_{k}^{\text {task }}=E\left(R_{k} ; \Theta^{\text {task }}\right)=\gamma^{\text {task }} \sum_{j=0}^{L} s_{j}^{\text {task }} \mathbf{h}_{k, j}^{L M}\).
- \(\mathrm{s}^{\text {task }}\) : softmax-normalized weights
- \(\gamma^{\text {task }}\) : scalar parameter
2-3. Using biLMs for supervised NLP tasks
Given pre-trained biLM & supervised architecture for target NLP task, it is a simple process to use the biLM to improve the task model!
Simply run the biLM and record all of the layer representations for each word
Let end task model learn a linear combination of these representations!