Uncertainty-Aware Attention for Reliable Interpretation and PredictionPermalink
ContentsPermalink
- Abstract
- Introduction
- Approach
- Stochastic Attention with input-adaptive Gaussian Noise
- Variational Inference
0. AbstractPermalink
Attention mechanism은 relevant feature에 집중할 수 있게끔 하는 엄청난 기능을 가지지만….
→ may be UNRELIABLE
이를 극복하기 위해….Input-dependent uncertainty라는 개념을 소개함
( = generates attention for each feature with varying degrees of noise based on the given input )
→ learn larger variance on instances, with high uncertainty
propse UA ( Uncertainty-aware Attention ) mechanism using VI
1. IntroductionPermalink
BackgroundPermalink
-
high reliability를 얻는 것은 매우 중요해! 특히 safety 관련해서!
-
Attention : find most relevant features for each input instance
( + allows easy interpretation )
-
Attention의 한계점 : unreliable
→ need a model that knows its own limitation!
( = 예측/판단을 내려도 될 정도로 safe한지를 모델 지 스스로 잘 알아야! )
ProposalPermalink
allow attention to output uncertainty on each input
-
더 나아가서, leverage them when making final predictions
-
구체적으로, attention weight를 Gaussian으로 모델링 ( input dependent noise O )
ContributionPermalink
- 1) Variational Attention 모델을 제안함
- 2) UA알고리즘이 accurate calibration of model uncertainty를 만듬
- 3) 6개의 real world 데이터에 테스스
2. ApproachPermalink
STOCHASTIC attention ( 최초 제안은 X )
- v(x)∈Rr×i : concatenation of i intermediate features
- each column of which vj(x) is a length r vector
- 이 v(x) 로부터 random variables {aj}ij=1 가 conditionally 생성된다
- c(x)=∑ij=1aj⊙vj(x) : context vector ( c∈Rr )
- ˆy=f(c(x)) : final output
attention은 deterministic / stochastic 할 수 있다
-
ex) stochastic attention : aj 는 Bernoulli distribution에서 생성됨.
→ maximize ELBO
-
stochastic attention > deterministic counterpart, on image annotation task.
2-1. Stochastic Attention with input-adaptive Gaussian NoisePermalink
위의 Stochastic attention의 2가지 한계점
( stochastic attention을 directly하게 Bernoulli/Multinomial에서 뽑는다는 점에서 )
-
[ 한계점 1 ] Bernoulli의 variance는 allocation probability μ와 DEPENDENT
- a∼Bernoulli(μ).
- allocation probability μ : attention 연결 할지/말지
- Bernoulli의 variance : σ2=μ(1−μ)
- 따라서, a의 variance는 낮기 어렵다 ( if μ가 0.5 부근 )
-
[ 해결책 1 ]
disentangle the attention strength a from the attention uncertainty
→ so that the uncertainty could vary even with the same attention strength
-
[ 한계점 2 ] Vanilla stochastic attention models the noise independently of the input
- (구 방식) input과 무관하게 noise를 모델링함
-
[ 해결책 2 ] 위의 두 한계점들을 극복하기 위해…
“input과 관련있게 noise를 모델링함 ( σ(x) )”
-
(구) p(\boldsymbol{\omega})=\mathcal{N}\left(\mathbf{0}, \tau^{-1} \mathbf{I}\right), \quad p_{\theta}(\mathbf{z} \mid \mathbf{x}, \boldsymbol{\omega})=\mathcal{N}\left(\boldsymbol{\mu}(\mathbf{x}, \boldsymbol{\omega} ; \theta), \operatorname{diag}\left(\boldsymbol{\sigma}^{2}\right)\right)
-
(신) p(\boldsymbol{\omega})=\mathcal{N}\left(\mathbf{0}, \tau^{-1} \mathbf{I}\right), \quad p_{\theta}(\mathbf{z} \mid \mathbf{x}, \boldsymbol{\omega})=\mathcal{N}\left(\boldsymbol{\mu}(\mathbf{x}, \boldsymbol{\omega} ; \theta), \operatorname{diag}\left(\boldsymbol{\sigma}^{2}(\mathbf{x}, \boldsymbol{\omega} ; \theta)\right)\right)
( 위 두 식에서, \mathbf{z}는 attention score before squashing… 즉 \mathrm{a}=\pi(\mathrm{z}) )
-
empirically shown that the quality of uncertainty improves
-
2-2. Variational InferencePermalink
( 위 2-1.에서 세운 방법을, VI를 통해 푼다 )
\mathbf{Z} : set of latent variables \left\{\mathbf{z}^{(n)}\right\}_{n=1}^{N} that stands for attention weight before squashing.
Posterior p(\mathbf{Z}, \boldsymbol{\omega} \mid \mathcal{D}) is usually computationally intractable !
\rightarrow use VI ( Variational Inference )
Variational Distribution :
-
q(\mathbf{Z}, \boldsymbol{\omega} \mid \mathcal{D})=q_{\mathbf{M}}(\boldsymbol{\omega} \mid \mathbf{X}, \mathbf{Y}) q(\mathbf{Z} \mid \mathbf{X}, \mathbf{Y}, \boldsymbol{\omega}).
- 1번째 term) MC Dropout 사용 ( variational parameter \mathbf{M} )
- 2번째 term) 그냥 set q(\mathbf{Z} \mid \mathbf{X}, \mathbf{Y}, \boldsymbol{\omega})=p_{\theta}(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\omega})
이를 정리하면, 아래의 ELBO를 maximize하는 것과 동일하다.
- \begin{aligned} \log p(\mathbf{Y} \mid \mathbf{X}) \geq & \mathbb{E}_{\boldsymbol{\omega} \sim q_{\mathbf{M}}(\boldsymbol{\omega} \mid \mathbf{X}, \mathbf{Y}), \mathbf{Z} \sim p_{\theta}(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\omega})}[\log p(\mathbf{Y} \mid \mathbf{X}, \mathbf{Z}, \boldsymbol{\omega})] \\ &-\mathrm{KL}\left[q_{\mathbf{M}}(\boldsymbol{\omega} \mid \mathbf{X}, \mathbf{Y}) \| p(\boldsymbol{\omega})\right]-\operatorname{KL}\left[q(\mathbf{Z} \mid \mathbf{X}, \mathbf{Y}, \boldsymbol{\omega}) \| p_{\theta}(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\omega})\right] \end{aligned}.
최종 maximize할 대상 : \mathcal{L}(\theta, \mathbf{M} ; \mathbf{X}, \mathbf{Y})=\sum \log p_{\theta}\left(\mathbf{y}^{(n)} \mid \tilde{\mathbf{z}}^{(n)}, \mathbf{x}^{(n)}\right)-\lambda\|\mathbf{M}\|^{2}.
- step 1) sample random weights with dropout masks \widetilde{\omega} \sim q_{\mathrm{M}}(\boldsymbol{\omega} \mid \mathbf{X}, \mathbf{Y})
- step 2) sample \mathbf{z} such that \tilde{\mathbf{z}}=q(\mathbf{x}, \tilde{\varepsilon}, \widetilde{\omega}), \tilde{\varepsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})
Testing new input \mathbf{x}^{*} : MC-sampling 사용해서..
- p\left(\mathbf{y}^{*} \mid \mathbf{x}^{*}\right)=\iint p\left(\mathbf{y}^{*} \mid \mathbf{x}^{*}, \mathbf{z}\right) p\left(\mathbf{z} \mid \mathbf{x}^{*}, \boldsymbol{\omega}\right) p(\boldsymbol{\omega} \mid \mathbf{X}, \mathbf{Y}) \mathrm{d} \boldsymbol{\omega} \mathrm{d} \mathbf{z} \approx \frac{1}{S} \sum_{s=1}^{S} p\left(\mathbf{y}^{*} \mid \mathbf{x}^{*}, \tilde{\mathbf{z}}^{(s)}\right)/
Uncertainty CalibrationPermalink
ECE (Expected Calibration Error)가 보다 나음을 확인!
( = expected gap w.r.t the distribution of model confidence )
\mathrm{ECE}=\mathbb{E}_{\text {confidence }}[\mid p( correct \mid confidence )- confidence \mid]