7.Bi-Directional Attention Flow for Machine Comprehension (2017)
목차
- Abstract
- Introduction
- Background : NMT
- RNN Encoder-Decoder
- Learning to Align and Translate
- Decoder : General Description
- Encoder : Bidirectional RNN for Annotating Sequences
Abstract
Machine Comprehension (MC)
- answering a querion!
- models interaction between (1) context & (2) query
introduce BiDAF ( Bi-Directional Attention Flow ) network
-
“multi-stage hierarchical” process
( that represents the context at “different levels of granularity” )
-
uses bi-directional attention flow mechanism to obtain a “query-aware context” representation
without early summarization
1. Introduction
task of MC(Machine Comprehension) & QA(Question Answering)
-
key factors of advancement : NEURAL ATTENTION MECHANISM
\(\rightarrow\) focus on a targeted area within a context paragraph
Attention mechanism
-
1) extract the most relevent information from context, for answering the question
-
2) temporally dynamic!
( = attention weight at the “current time” step are a function of “previous time” step )
Introduce BiDAF ( Bi-Directional Attention Flow ) network
-
hierarchical : (1) character-level & (2) word-level & (3) contextual embeddings & (4) biDAF
-
attention layer
Property 1)
-
NOT used to summarize the context paragraph ( into a fixed-size vector )
-
INSTEAD, attention is computed for every time step & flow through the subsequent layer
\(\rightarrow\) reduces information loss
Property 2)
-
MEMORY-LESS attention mechanism
( = iteratively compute attention )
( & does not directly depend on the attention at the previous time step )
-
2. Model
1) Character Embedding Layer
2) Word Embedding Layer
3) Contextual Embedding Layer
4) Attention Flow Layer
5) Modeling Layer
6) Output Layer
1) Character Embedding Layer
-
map each word into high-dim vector space
-
words in input contexxt paragraph : \(\left\{x_{1}, \ldots x_{T}\right\}\).
query : \(\left\{q_{1}, \ldots q_{J}\right\}\).
-
use CharCNN
2) Word Embedding Layer
- also map each word into high-dim vector space
- use pre-trained GloVe
Concatenate 1) & 2) \(\rightarrow\) passed to 2-layer Highway Networks
Output : 2 sequences of \(d\)-dim matrices
- 1) \(\mathbf{X} \in \mathbb{R}^{d \times T}\) : for the context
- 2) \(\mathbf{Q} \in \mathbb{R}^{d \times J}\) : for the query
3) Contextual Embedding Layer
-
use biLSTM on top of the embeddings ( of output 2) )
( to model interactions between words )
-
obtain two vectors
- 1) \(\mathbf{H} \in \mathbb{R}^{2 d \times T}\) : from context word vectors \(\mathbf{X}\)
- 2) \(\mathbf{U} \in \mathbb{R}^{2 d \times J}\) : from query word vectors \(\mathbf{Q}\)
4) Attention Flow Layer
-
linking & fusing information from the “context” & “query” words
-
not used to summarize the query and context into single feature vectors
instead, flow through to the subsequent modeling layer
-
input : context \(\mathbf{H}\) and query \(\mathbf{U}\)
-
compute the attention in 2 directions
- 1) context 2 query
- 2) query 2 context
-
similarity matrix : \(\mathbf{S}_{t j}=\alpha\left(\mathbf{H}_{: t}, \mathbf{U}_{: j}\right) \in \mathbb{R}\).
-
\(\alpha\) : trainable scalar function that encodes the similarity between its two input vectors
\(\rightarrow\) \(\alpha(\mathbf{h}, \mathbf{u})=\mathbf{w}_{(\mathbf{S})}^{\top}[\mathbf{h} ; \mathbf{u} ; \mathbf{h}\).
-
5) Modeling Layer, \(\mathbf{G}\)
- use bi-LSTM
- encodes query-aware representations of context words
- captures the interaction among context words, conditioned on the query
- ( different from contextual embedding \(\rightarrow\) contextual embedding captures interaction among context words independent of the query)
- obtain \(\mathbf{M} \in \mathbb{R}^{2 d \times T}\)
6) Output Layer
- application specific ( in this paper : use it for QA task )
- QA-task
- requires the model to find a “sub-phrase” of the paragraph to answer question
- phrase is derived by predicting the “start” & “end” indicies
- Starting index : \(\mathbf{p}^{1}=\operatorname{softmax}\left(\mathbf{w}_{\left(\mathbf{p}^{1}\right)}^{\top}[\mathbf{G} ; \mathbf{M}]\right)\).
- End index : \(\mathbf{p}^{2}=\operatorname{softmax}\left(\mathbf{w}_{\left(\mathbf{p}^{2}\right)}^{\top}\left[\mathbf{G} ; \mathbf{M}^{2}\right]\right)\).
Loss Function : NLL
- \(L(\theta)=-\frac{1}{N} \sum_{i}^{N} \log \left(\mathbf{p}_{y_{i}^{1}}^{1}\right)+\log \left(\mathbf{p}_{y_{i}^{2}}^{2}\right)\).