Uncertainty-Aware Attention for Reliable Interpretation and PredictionPermalink

ContentsPermalink

  1. Abstract
  2. Introduction
  3. Approach
    1. Stochastic Attention with input-adaptive Gaussian Noise
    2. Variational Inference


0. AbstractPermalink

Attention mechanism은 relevant feature에 집중할 수 있게끔 하는 엄청난 기능을 가지지만….

may be UNRELIABLE


이를 극복하기 위해….Input-dependent uncertainty라는 개념을 소개함

( = generates attention for each feature with varying degrees of noise based on the given input )

learn larger variance on instances, with high uncertainty


propse UA ( Uncertainty-aware Attention ) mechanism using VI


1. IntroductionPermalink

BackgroundPermalink

  • high reliability를 얻는 것은 매우 중요해! 특히 safety 관련해서!

  • Attention : find most relevant features for each input instance

    ( + allows easy interpretation )

  • Attention의 한계점 : unreliable

    need a model that knows its own limitation!

    ( = 예측/판단을 내려도 될 정도로 safe한지를 모델 지 스스로 잘 알아야! )


ProposalPermalink

allow attention to output uncertainty on each input

  • 더 나아가서, leverage them when making final predictions

  • 구체적으로, attention weight를 Gaussian으로 모델링 ( input dependent noise O )


ContributionPermalink

  • 1) Variational Attention 모델을 제안함
  • 2) UA알고리즘이 accurate calibration of model uncertainty를 만듬
  • 3) 6개의 real world 데이터에 테스스


2. ApproachPermalink

STOCHASTIC attention ( 최초 제안은 X )

  • v(x)Rr×i : concatenation of i intermediate features
    • each column of which vj(x) is a length r vector
    • v(x) 로부터 random variables {aj}ij=1 가 conditionally 생성된다
  • c(x)=ij=1ajvj(x) : context vector ( cRr )
  • ˆy=f(c(x)) : final output


attention은 deterministic / stochastic 할 수 있다

  • ex) stochastic attention : ajBernoulli distribution에서 생성됨.

    maximize ELBO

  • stochastic attention > deterministic counterpart, on image annotation task.


2-1. Stochastic Attention with input-adaptive Gaussian NoisePermalink

위의 Stochastic attention의 2가지 한계점

( stochastic attention을 directly하게 Bernoulli/Multinomial에서 뽑는다는 점에서 )

  • [ 한계점 1 ] Bernoulli의 variance는 allocation probability μ와 DEPENDENT

    • aBernoulli(μ).
    • allocation probability μ : attention 연결 할지/말지
    • Bernoulli의 variance : σ2=μ(1μ)
    • 따라서, a의 variance는 낮기 어렵다 ( if μ가 0.5 부근 )
  • [ 해결책 1 ]

    disentangle the attention strength a from the attention uncertainty

    so that the uncertainty could vary even with the same attention strength


  • [ 한계점 2 ] Vanilla stochastic attention models the noise independently of the input

    • (구 방식) input과 무관하게 noise를 모델링함
  • [ 해결책 2 ] 위의 두 한계점들을 극복하기 위해…

    “input과 관련있게 noise를 모델링함 ( σ(x) )”

    • (구) p(\boldsymbol{\omega})=\mathcal{N}\left(\mathbf{0}, \tau^{-1} \mathbf{I}\right), \quad p_{\theta}(\mathbf{z} \mid \mathbf{x}, \boldsymbol{\omega})=\mathcal{N}\left(\boldsymbol{\mu}(\mathbf{x}, \boldsymbol{\omega} ; \theta), \operatorname{diag}\left(\boldsymbol{\sigma}^{2}\right)\right)

    • (신) p(\boldsymbol{\omega})=\mathcal{N}\left(\mathbf{0}, \tau^{-1} \mathbf{I}\right), \quad p_{\theta}(\mathbf{z} \mid \mathbf{x}, \boldsymbol{\omega})=\mathcal{N}\left(\boldsymbol{\mu}(\mathbf{x}, \boldsymbol{\omega} ; \theta), \operatorname{diag}\left(\boldsymbol{\sigma}^{2}(\mathbf{x}, \boldsymbol{\omega} ; \theta)\right)\right)

      ( 위 두 식에서, \mathbf{z}는 attention score before squashing… 즉 \mathrm{a}=\pi(\mathrm{z}) )

    • empirically shown that the quality of uncertainty improves


2-2. Variational InferencePermalink

( 위 2-1.에서 세운 방법을, VI를 통해 푼다 )

\mathbf{Z} : set of latent variables \left\{\mathbf{z}^{(n)}\right\}_{n=1}^{N} that stands for attention weight before squashing.

Posterior p(\mathbf{Z}, \boldsymbol{\omega} \mid \mathcal{D}) is usually computationally intractable !

\rightarrow use VI ( Variational Inference )


Variational Distribution :

  • q(\mathbf{Z}, \boldsymbol{\omega} \mid \mathcal{D})=q_{\mathbf{M}}(\boldsymbol{\omega} \mid \mathbf{X}, \mathbf{Y}) q(\mathbf{Z} \mid \mathbf{X}, \mathbf{Y}, \boldsymbol{\omega}).

  • 1번째 term) MC Dropout 사용 ( variational parameter \mathbf{M} )
  • 2번째 term) 그냥 set q(\mathbf{Z} \mid \mathbf{X}, \mathbf{Y}, \boldsymbol{\omega})=p_{\theta}(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\omega})


이를 정리하면, 아래의 ELBO를 maximize하는 것과 동일하다.

  • \begin{aligned} \log p(\mathbf{Y} \mid \mathbf{X}) \geq & \mathbb{E}_{\boldsymbol{\omega} \sim q_{\mathbf{M}}(\boldsymbol{\omega} \mid \mathbf{X}, \mathbf{Y}), \mathbf{Z} \sim p_{\theta}(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\omega})}[\log p(\mathbf{Y} \mid \mathbf{X}, \mathbf{Z}, \boldsymbol{\omega})] \\ &-\mathrm{KL}\left[q_{\mathbf{M}}(\boldsymbol{\omega} \mid \mathbf{X}, \mathbf{Y}) \| p(\boldsymbol{\omega})\right]-\operatorname{KL}\left[q(\mathbf{Z} \mid \mathbf{X}, \mathbf{Y}, \boldsymbol{\omega}) \| p_{\theta}(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\omega})\right] \end{aligned}.


최종 maximize할 대상 : \mathcal{L}(\theta, \mathbf{M} ; \mathbf{X}, \mathbf{Y})=\sum \log p_{\theta}\left(\mathbf{y}^{(n)} \mid \tilde{\mathbf{z}}^{(n)}, \mathbf{x}^{(n)}\right)-\lambda\|\mathbf{M}\|^{2}.

  • step 1) sample random weights with dropout masks \widetilde{\omega} \sim q_{\mathrm{M}}(\boldsymbol{\omega} \mid \mathbf{X}, \mathbf{Y})
  • step 2) sample \mathbf{z} such that \tilde{\mathbf{z}}=q(\mathbf{x}, \tilde{\varepsilon}, \widetilde{\omega}), \tilde{\varepsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})


Testing new input \mathbf{x}^{*} : MC-sampling 사용해서..

  • p\left(\mathbf{y}^{*} \mid \mathbf{x}^{*}\right)=\iint p\left(\mathbf{y}^{*} \mid \mathbf{x}^{*}, \mathbf{z}\right) p\left(\mathbf{z} \mid \mathbf{x}^{*}, \boldsymbol{\omega}\right) p(\boldsymbol{\omega} \mid \mathbf{X}, \mathbf{Y}) \mathrm{d} \boldsymbol{\omega} \mathrm{d} \mathbf{z} \approx \frac{1}{S} \sum_{s=1}^{S} p\left(\mathbf{y}^{*} \mid \mathbf{x}^{*}, \tilde{\mathbf{z}}^{(s)}\right)/


Uncertainty CalibrationPermalink

ECE (Expected Calibration Error)가 보다 나음을 확인!

( = expected gap w.r.t the distribution of model confidence )

\mathrm{ECE}=\mathbb{E}_{\text {confidence }}[\mid p( correct \mid confidence )- confidence \mid]

figure2