7.Bi-Directional Attention Flow for Machine Comprehension (2017)

목차

  1. Abstract
  2. Introduction
  3. Background : NMT
    1. RNN Encoder-Decoder
  4. Learning to Align and Translate
    1. Decoder : General Description
    2. Encoder : Bidirectional RNN for Annotating Sequences


Abstract

Machine Comprehension (MC)

  • answering a querion!
  • models interaction between (1) context & (2) query


introduce BiDAF ( Bi-Directional Attention Flow ) network

  • “multi-stage hierarchical” process

    ( that represents the context at “different levels of granularity” )

  • uses bi-directional attention flow mechanism to obtain a “query-aware context” representation

    without early summarization


1. Introduction

task of MC(Machine Comprehension) & QA(Question Answering)

  • key factors of advancement : NEURAL ATTENTION MECHANISM

    \(\rightarrow\) focus on a targeted area within a context paragraph


Attention mechanism

  • 1) extract the most relevent information from context, for answering the question

  • 2) temporally dynamic!

    ( = attention weight at the “current time” step are a function of “previous time” step )


Introduce BiDAF ( Bi-Directional Attention Flow ) network

  • hierarchical : (1) character-level & (2) word-level & (3) contextual embeddings & (4) biDAF

  • attention layer

    Property 1)

    • NOT used to summarize the context paragraph ( into a fixed-size vector )

    • INSTEAD, attention is computed for every time step & flow through the subsequent layer

      \(\rightarrow\) reduces information loss

    Property 2)

    • MEMORY-LESS attention mechanism

      ( = iteratively compute attention )

      ( & does not directly depend on the attention at the previous time step )


2. Model

1) Character Embedding Layer

2) Word Embedding Layer

3) Contextual Embedding Layer

4) Attention Flow Layer

5) Modeling Layer

6) Output Layer


1) Character Embedding Layer

  • map each word into high-dim vector space

  • words in input contexxt paragraph : \(\left\{x_{1}, \ldots x_{T}\right\}\).

    query : \(\left\{q_{1}, \ldots q_{J}\right\}\).

  • use CharCNN


2) Word Embedding Layer

  • also map each word into high-dim vector space
  • use pre-trained GloVe


Concatenate 1) & 2) \(\rightarrow\) passed to 2-layer Highway Networks

Output : 2 sequences of \(d\)-dim matrices

  • 1) \(\mathbf{X} \in \mathbb{R}^{d \times T}\) : for the context
  • 2) \(\mathbf{Q} \in \mathbb{R}^{d \times J}\) : for the query


3) Contextual Embedding Layer

  • use biLSTM on top of the embeddings ( of output 2) )

    ( to model interactions between words )

  • obtain two vectors

    • 1) \(\mathbf{H} \in \mathbb{R}^{2 d \times T}\) : from context word vectors \(\mathbf{X}\)
    • 2) \(\mathbf{U} \in \mathbb{R}^{2 d \times J}\) : from query word vectors \(\mathbf{Q}\)


4) Attention Flow Layer

  • linking & fusing information from the “context” & “query” words

  • not used to summarize the query and context into single feature vectors

    instead, flow through to the subsequent modeling layer

  • input : context \(\mathbf{H}\) and query \(\mathbf{U}\)

  • compute the attention in 2 directions

    • 1) context 2 query
    • 2) query 2 context
  • similarity matrix : \(\mathbf{S}_{t j}=\alpha\left(\mathbf{H}_{: t}, \mathbf{U}_{: j}\right) \in \mathbb{R}\).

    • \(\alpha\) : trainable scalar function that encodes the similarity between its two input vectors

      \(\rightarrow\) \(\alpha(\mathbf{h}, \mathbf{u})=\mathbf{w}_{(\mathbf{S})}^{\top}[\mathbf{h} ; \mathbf{u} ; \mathbf{h}\).


5) Modeling Layer, \(\mathbf{G}\)

  • use bi-LSTM
  • encodes query-aware representations of context words
  • captures the interaction among context words, conditioned on the query
  • ( different from contextual embedding \(\rightarrow\) contextual embedding captures interaction among context words independent of the query)
  • obtain \(\mathbf{M} \in \mathbb{R}^{2 d \times T}\)


6) Output Layer

  • application specific ( in this paper : use it for QA task )
  • QA-task
    • requires the model to find a “sub-phrase” of the paragraph to answer question
    • phrase is derived by predicting the “start” & “end” indicies
  • Starting index : \(\mathbf{p}^{1}=\operatorname{softmax}\left(\mathbf{w}_{\left(\mathbf{p}^{1}\right)}^{\top}[\mathbf{G} ; \mathbf{M}]\right)\).
  • End index : \(\mathbf{p}^{2}=\operatorname{softmax}\left(\mathbf{w}_{\left(\mathbf{p}^{2}\right)}^{\top}\left[\mathbf{G} ; \mathbf{M}^{2}\right]\right)\).


Loss Function : NLL

  • \(L(\theta)=-\frac{1}{N} \sum_{i}^{N} \log \left(\mathbf{p}_{y_{i}^{1}}^{1}\right)+\log \left(\mathbf{p}_{y_{i}^{2}}^{2}\right)\).