7.Bi-Directional Attention Flow for Machine Comprehension (2017)

Abstract
Introduction
Background : NMT
1. RNN Encoder-Decoder
Learning to Align and Translate
1. Decoder : General Description
2. Encoder : Bidirectional RNN for Annotating Sequences

Abstract

Machine Comprehension (MC)

introduce BiDAF ( Bi-Directional Attention Flow ) network

“multi-stage hierarchical” process

( that represents the context at “different levels of granularity” )
uses bi-directional attention flow mechanism to obtain a “query-aware context” representation

without early summarization

task of MC(Machine Comprehension) & QA(Question Answering)

key factors of advancement : NEURAL ATTENTION MECHANISM

\(\rightarrow\) focus on a targeted area within a context paragraph

Attention mechanism

1) extract the most relevent information from context, for answering the question
2) temporally dynamic!

( = attention weight at the “current time” step are a function of “previous time” step )

Introduce BiDAF ( Bi-Directional Attention Flow ) network

hierarchical : (1) character-level & (2) word-level & (3) contextual embeddings & (4) biDAF
attention layer

Property 1)
- NOT used to summarize the context paragraph ( into a fixed-size vector )
- INSTEAD, attention is computed for every time step & flow through the subsequent layer
  
  \(\rightarrow\) reduces information loss
Property 2)
- MEMORY-LESS attention mechanism
  
  ( = iteratively compute attention )
  
  ( & does not directly depend on the attention at the previous time step )

1) Character Embedding Layer

2) Word Embedding Layer

3) Contextual Embedding Layer

4) Attention Flow Layer

5) Modeling Layer

6) Output Layer

1) Character Embedding Layer

map each word into high-dim vector space
words in input contexxt paragraph : \(\left\{x_{1}, \ldots x_{T}\right\}\).

query : \(\left\{q_{1}, \ldots q_{J}\right\}\).
use CharCNN

2) Word Embedding Layer

Concatenate 1) & 2) \(\rightarrow\) passed to 2-layer Highway Networks

Output : 2 sequences of \(d\)-dim matrices

3) Contextual Embedding Layer

use biLSTM on top of the embeddings ( of output 2) )

( to model interactions between words )
obtain two vectors
- 1) \(\mathbf{H} \in \mathbb{R}^{2 d \times T}\) : from context word vectors \(\mathbf{X}\)
- 2) \(\mathbf{U} \in \mathbb{R}^{2 d \times J}\) : from query word vectors \(\mathbf{Q}\)

4) Attention Flow Layer

linking & fusing information from the “context” & “query” words
not used to summarize the query and context into single feature vectors

instead, flow through to the subsequent modeling layer
input : context \(\mathbf{H}\) and query \(\mathbf{U}\)
compute the attention in 2 directions
- 1) context 2 query
- 2) query 2 context
similarity matrix : \(\mathbf{S}_{t j}=\alpha\left(\mathbf{H}_{: t}, \mathbf{U}_{: j}\right) \in \mathbb{R}\).
- \(\alpha\) : trainable scalar function that encodes the similarity between its two input vectors
  
  \(\rightarrow\) \(\alpha(\mathbf{h}, \mathbf{u})=\mathbf{w}_{(\mathbf{S})}^{\top}[\mathbf{h} ; \mathbf{u} ; \mathbf{h}\).

5) Modeling Layer, \(\mathbf{G}\)

use bi-LSTM
encodes query-aware representations of context words
captures the interaction among context words, conditioned on the query
( different from contextual embedding \(\rightarrow\) contextual embedding captures interaction among context words independent of the query)
obtain \(\mathbf{M} \in \mathbb{R}^{2 d \times T}\)

6) Output Layer

application specific ( in this paper : use it for QA task )
QA-task
- requires the model to find a “sub-phrase” of the paragraph to answer question
- phrase is derived by predicting the “start” & “end” indicies
Starting index : \(\mathbf{p}^{1}=\operatorname{softmax}\left(\mathbf{w}_{\left(\mathbf{p}^{1}\right)}^{\top}[\mathbf{G} ; \mathbf{M}]\right)\).
End index : \(\mathbf{p}^{2}=\operatorname{softmax}\left(\mathbf{w}_{\left(\mathbf{p}^{2}\right)}^{\top}\left[\mathbf{G} ; \mathbf{M}^{2}\right]\right)\).

Loss Function : NLL

\(L(\theta)=-\frac{1}{N} \sum_{i}^{N} \log \left(\mathbf{p}_{y_{i}^{1}}^{1}\right)+\log \left(\mathbf{p}_{y_{i}^{2}}^{2}\right)\).