Context-Aware Self-Attention Networks (2019)

Abstract
Introduction
Background ( SANs )
1. 기존의 SANs
2. Motivation
3. Approach

0. Abstract

Self-attention의 특징/장점

flexibility in parallel computation
long & short-term dependency 모두 잡아내

\(\rightarrow\) contextual information을 잡아내지는 X

따라서, 이 논문은 self-attention network (SAN)를 improve시킴!

( by capturing the CONTEXTUAL information )

HOW? contextualize the transformations of the QUERY and KEY layers

1. Introduction

SANs ( Self-attention networks )

Lin et al. 2017
장점 : long-range dependency도 포착 가능
문제점 : treat input sequence as bag-of-word tokens &

각각의 token INDIVIDUALLY performs attention

\(\rightarrow\) contextual information 반영 X

Propose to strengthen SANs through capturing the richness of context

2. Background ( SANs )

2-1) 기존의 SANs

“pair of token 사이의 attention weight를 계산한다”

Input Layer : \(\mathbf{H}=\left\{h_{1}, \ldots, h_{n}\right\}\)
Q,K,V : \(\mathbf{H}\) 에 weight를 곱해서 얻어낸다. …. \(\left[\begin{array}{c} \mathbf{Q} \\ \mathbf{K} \\ \mathbf{V} \end{array}\right]=\mathbf{H}\left[\begin{array}{l} \mathbf{W}_{Q} \\ \mathbf{W}_{K} \\ \mathbf{W}_{V} \end{array}\right]\)
Output Layer : \(\mathrm{O}=\operatorname{ATT}(\mathbf{Q}, \mathbf{K}) \mathbf{V}\)
Attention 2가지 방법
- 1) additive
- 2) dot-product \(\rightarrow\) 이거 사용! \(\operatorname{ATT}(\mathbf{Q}, \mathbf{K})=\operatorname{softmax}\left(\frac{\mathbf{Q K}^{T}}{\sqrt{d}}\right)\)

2-2) Motivation

\(Q\)와 \(V\) 사이의 similarity 계산 시… only 2 parameter matrices

\(\mathbf{Q K}^{T}=\left(\mathbf{H W}_{Q}\right)\left(\mathbf{H W}_{K}\right)^{T}=\mathbf{H}\left(\mathbf{W}_{Q} \mathbf{W}_{K}^{T}\right) \mathbf{H}^{T}\).

2-3) Approach

여러 종류의 Context vector를 제안한다

( + 어떻게 SAN에 incorporate 시킬지 )

(a) Context-Aware Self-Attention Model

propose to contextualize the transformations from the input layer \(\mathbf{H}\) to Q & V

Shaw et al (2018)의 contextual information을 propagate하는 방법 차용

\(\left[\begin{array}{c} \widehat{\mathbf{Q}} \\ \widehat{\mathbf{K}} \end{array}\right]=\left(1-\left[\begin{array}{c} \lambda_{Q} \\ \lambda_{K} \end{array}\right]\right)\left[\begin{array}{l} \mathbf{Q} \\ \mathbf{K} \end{array}\right]+\left[\begin{array}{c} \lambda_{Q} \\ \lambda_{K} \end{array}\right]\left(\mathbf{C}\left[\begin{array}{c} \mathbf{U}_{Q} \\ \mathbf{U}_{K} \end{array}\right]\right)\).

where \(\mathbf{C} \in \mathbb{R}^{n \times d_{c}}\) is the context vector

where \(\left\{\mathbf{U}_{Q}, \mathbf{U}_{K}\right\} \in\) is a trainable parameter

\(\left\{\lambda_{Q}, \lambda_{K}\right\}\) : can also be treated as factors to regulate the magnitude of \(\widehat{Q}\) and \(\widehat{K}\)

이 논문에서는 아래와 같이 제안.

\(\left[\begin{array}{l} \lambda_{Q} \\ \lambda_{K} \end{array}\right]=\sigma\left(\left[\begin{array}{l} \mathbf{Q} \\ \mathbf{K} \end{array}\right]\left[\begin{array}{c} \mathbf{V}_{Q}^{H} \\ \mathbf{V}_{K}^{H} \end{array}\right]+\mathbf{C}\left[\begin{array}{c} \mathbf{U}_{Q} \\ \mathbf{U}_{K} \end{array}\right]\left[\begin{array}{c} \mathbf{V}_{Q}^{C} \\ \mathbf{V}_{K}^{C} \end{array}\right]\right)\).

where \(\left\{\mathbf{V}_{Q}^{H}, \mathbf{V}_{K}^{H}\right\} \in \mathbb{R}^{d \times 1}\) and \(\left\{\mathbf{V}_{Q}^{C}, \mathbf{V}_{K}^{C}\right\} \in \mathbb{R}^{d_{c} \times 1}\) are trainable parameters

이 gating scalar의 역할

enables the model to explicitly quantify how much each representation and context vector contribute to the prediction of the attention weight

(b) Choices of Context Vectors

Global Context : \(\mathbf{c}=\overline{\mathbf{H}} \quad \in \mathbb{R}^{d}\)
Deep Context : \(\mathbf{C}=\left[\mathbf{H}^{1}, \ldots, \mathbf{H}^{l-1}\right] \quad \in \mathbb{R}^{n \times(l-1) d}\)
Deep-Global Context : \(\mathbf{c}=\left[\mathbf{c}^{1}, \ldots, \mathbf{c}^{l}\right] \quad \in \mathbb{R}^{l d}\)

아래 그림으로 쉽게 이해 가능!

Twitter Facebook LinkedIn

(paper) Context-Aware Self-Attention Networks (2019)

Seunghan Lee