7.Bi-Directional Attention Flow for Machine Comprehension (2017)
- Abstract
- Introduction
- Background : NMT
- RNN Encoder-Decoder
- Learning to Align and Translate
- Decoder : General Description
- Encoder : Bidirectional RNN for Annotating Sequences
Machine Comprehension (MC)
- answering a querion!
- models interaction between (1) context & (2) query
introduce BiDAF ( Bi-Directional Attention Flow ) network
“multi-stage hierarchical” process
( that represents the context at “different levels of granularity” )
uses bi-directional attention flow mechanism to obtain a “query-aware context” representation
without early summarization
1. Introduction
task of MC(Machine Comprehension) & QA(Question Answering)
key factors of advancement : NEURAL ATTENTION MECHANISM
\(\rightarrow\) focus on a targeted area within a context paragraph
Attention mechanism
1) extract the most relevent information from context, for answering the question
2) temporally dynamic!
( = attention weight at the “current time” step are a function of “previous time” step )
Introduce BiDAF ( Bi-Directional Attention Flow ) network
hierarchical : (1) character-level & (2) word-level & (3) contextual embeddings & (4) biDAF
attention layer
Property 1)
NOT used to summarize the context paragraph ( into a fixed-size vector )
INSTEAD, attention is computed for every time step & flow through the subsequent layer
\(\rightarrow\) reduces information loss
Property 2)
MEMORY-LESS attention mechanism
( = iteratively compute attention )
( & does not directly depend on the attention at the previous time step )
2. Model
1) Character Embedding Layer
2) Word Embedding Layer
3) Contextual Embedding Layer
4) Attention Flow Layer
5) Modeling Layer
6) Output Layer
1) Character Embedding Layer
map each word into high-dim vector space
words in input contexxt paragraph : \(\left\{x_{1}, \ldots x_{T}\right\}\).
query : \(\left\{q_{1}, \ldots q_{J}\right\}\).
use CharCNN
2) Word Embedding Layer
- also map each word into high-dim vector space
- use pre-trained GloVe
Concatenate 1) & 2) \(\rightarrow\) passed to 2-layer Highway Networks
Output : 2 sequences of \(d\)-dim matrices
- 1) \(\mathbf{X} \in \mathbb{R}^{d \times T}\) : for the context
- 2) \(\mathbf{Q} \in \mathbb{R}^{d \times J}\) : for the query
3) Contextual Embedding Layer
use biLSTM on top of the embeddings ( of output 2) )
( to model interactions between words )
obtain two vectors
- 1) \(\mathbf{H} \in \mathbb{R}^{2 d \times T}\) : from context word vectors \(\mathbf{X}\)
- 2) \(\mathbf{U} \in \mathbb{R}^{2 d \times J}\) : from query word vectors \(\mathbf{Q}\)
4) Attention Flow Layer
linking & fusing information from the “context” & “query” words
not used to summarize the query and context into single feature vectors
instead, flow through to the subsequent modeling layer
input : context \(\mathbf{H}\) and query \(\mathbf{U}\)
compute the attention in 2 directions
- 1) context 2 query
- 2) query 2 context
similarity matrix : \(\mathbf{S}_{t j}=\alpha\left(\mathbf{H}_{: t}, \mathbf{U}_{: j}\right) \in \mathbb{R}\).
\(\alpha\) : trainable scalar function that encodes the similarity between its two input vectors
\(\rightarrow\) \(\alpha(\mathbf{h}, \mathbf{u})=\mathbf{w}_{(\mathbf{S})}^{\top}[\mathbf{h} ; \mathbf{u} ; \mathbf{h}\).
5) Modeling Layer, \(\mathbf{G}\)
- use bi-LSTM
- encodes query-aware representations of context words
- captures the interaction among context words, conditioned on the query
- ( different from contextual embedding \(\rightarrow\) contextual embedding captures interaction among context words independent of the query)
- obtain \(\mathbf{M} \in \mathbb{R}^{2 d \times T}\)
6) Output Layer
- application specific ( in this paper : use it for QA task )
- QA-task
- requires the model to find a “sub-phrase” of the paragraph to answer question
- phrase is derived by predicting the “start” & “end” indicies
- Starting index : \(\mathbf{p}^{1}=\operatorname{softmax}\left(\mathbf{w}_{\left(\mathbf{p}^{1}\right)}^{\top}[\mathbf{G} ; \mathbf{M}]\right)\).
- End index : \(\mathbf{p}^{2}=\operatorname{softmax}\left(\mathbf{w}_{\left(\mathbf{p}^{2}\right)}^{\top}\left[\mathbf{G} ; \mathbf{M}^{2}\right]\right)\).
Loss Function : NLL
- \(L(\theta)=-\frac{1}{N} \sum_{i}^{N} \log \left(\mathbf{p}_{y_{i}^{1}}^{1}\right)+\log \left(\mathbf{p}_{y_{i}^{2}}^{2}\right)\).