Variational Inference for Nonparametric Bayesian Quantile Regression (2015)
Abstract
Present a non-parametric method of inferring quantiles & derive novel VB approximation
1. Introduction
Quantile regression
- introduced as method of modelling variation in functions
2 main approaches used in inferring quantiles
-
1) CDF
-
2) Loss function that penalizes predictive quantiles at wrong locations
- observations \(\mathbf{y_i}\) & inferred quantile \(\mathbf{f_i}\)
-
\(\mathcal{L}\left(\xi_{i}, \alpha\right)=\left\{\begin{array}{ll} \alpha \xi_{i} & \text { if } \xi_{i} \geq 0 \\ (\alpha-1) \xi_{i} & \text { if } \xi_{i}<0 . \end{array}\right.\).
-
with regularization )
\(\mathcal{L}(\alpha, \mathbf{y}, \mathbf{f})+\lambda\mid\mathbf{f}\mid\) , where \(\mathcal{L}(\alpha, \mathbf{y}, \mathbf{f})=\sum_{i=1}^{N} \mathcal{L}\left(\mathbf{y}_{i}-\mathbf{f}_{i}, \alpha\right)\)
this paper adopts second approach
\(\rightarrow\) loss function is minimized, but within a Bayesian framework
derives a non-parametric approach to modelling the quantile function
2. Bayesian Quantile regression
goal : derive the posterior \(p\left(\mathbf{f}_{\star} \mid \mathbf{y}, \mathbf{x}_{\star}, \mathbf{x}\right)\)
- \(\mathbf{f}_{\star}\) : prediction for some input \(\mathbf{x}_{\star}\)
- done by marginalizing out all latent variables
- priors )
- on function : Gaussian Process prior
- on \(\sigma\) : Inverse Gamma prior
Data Likelihood
( = exponentiation of the Pinball loss \(\mathcal{L}\left(\xi_{i}, \alpha\right)=\left\{\begin{array}{ll} \alpha \xi_{i} & \text { if } \xi_{i} \geq 0 \\ (\alpha-1) \xi_{i} & \text { if } \xi_{i}<0 . \end{array}\right.\) )
\(p\left(\mathbf{y}_{i} \mid \mathbf{f}_{i}, \alpha, \sigma, \mathbf{x}_{i}\right) =\frac{\alpha(1-\alpha)}{\sigma} \exp \left(-\frac{\xi_{i}\left(\alpha-I\left(\xi_{i}<0\right)\right)}{\sigma}\right)\).
-
where \(\xi_{i}=\mathbf{y}_{i}-\mathbf{f}_{i}^{1}\).
- \(p(\mathbf{f} \mid \mathbf{x}) =\mathcal{N}(\mathbf{m}(\mathbf{x}), \mathbf{K}(\mathbf{x}))\).
- \(p(\sigma) =\operatorname{IG}\left(10^{-6}, 10^{-6}\right)\).
Important property of likelihood :
-
\(p\left(\mathbf{y}_{i}<\mathbf{f}_{i}\right)=\alpha\).
-
can be written as mixture of Gaussians :
\(p\left(\mathbf{y}_{i} \mid \mathbf{f}_{i}, \mathbf{x}_{i}, \sigma, \alpha\right)=\int \mathcal{N}\left(\mathbf{y}_{i} \mid \mu_{\mathbf{y}_{i}}, \sigma_{\mathbf{y}_{i}}\right) \exp \left(-\mathbf{w}_{i}\right) d \mathbf{w}\).
- \(\mu_{\mathbf{y}_{i}}=\mathbf{f}_{i}\left(\mathbf{x}_{i}\right)+\frac{1-2 \alpha}{\alpha(1-\alpha)} \sigma \mathbf{w}_{i}\).
- \(\sigma_{\mathbf{y}_{i}}=\frac{2}{\alpha(1-\alpha)} \sigma^{2} \mathbf{w}_{i}\).
-
likelihood can be represented as a joint distn with \(\mathbf{w}\) ( will be marginalized out )
3. Variational Bayesian Inference
Data Likelihood :
-
\(\log p(\mathbf{y} \mid \mathbf{x}, \alpha, \theta)\).
-
can also be expressed as
\(\mathcal{L}(q(\mathbf{f}, \mathbf{w}, \sigma), \theta \mid \alpha)+ K L(q(\mathbf{f}, \mathbf{w}, \sigma) \mid p(\mathbf{f}, \mathbf{w}, \sigma \mid \mathbf{y}, \theta, \alpha))\).
where \(\mathcal{L} =\iint q(\mathbf{f}, \mathbf{w}, \sigma) \log \frac{p(\mathbf{f}, \mathbf{w}, \sigma, \mathbf{y} \mid \theta, \alpha)}{q(\mathbf{f}, \mathbf{w}, \sigma)} d \mathbf{f} d \mathbf{w} d \sigma\).
Closed form solution :
- \(q\left(z_{i}\right)=\exp (E(\log p(\mathbf{z}, \mathbf{y})) / Z\).
Approximate posterior on the function space : \(\mathcal{N}(\mu, \Sigma)\)
-
\(\Sigma=\left(\left\langle\mathrm{D}^{-1}\right\rangle+\mathrm{K}^{-1}\right)^{-1}\).
-
\(\mu =\Sigma\left(\left\langle\mathrm{D}^{-1}\right\rangle \mathrm{y}-\frac{1-2 \alpha}{2}\left\langle\frac{1}{\sigma}\right\rangle 1\right)\).
- \(\mathbf{D}=\frac{2}{\alpha(1-\alpha)} \sigma^{2} \operatorname{diag}(\mathbf{w}) .\).
- \(\langle\mathbf{f}\rangle=\boldsymbol{\mu}\).
- \(\left\langle\mathrm{ff}^{T}\right\rangle=\Sigma+\mu \mu^{T}\).
4. Hyper-parameter Optimization
나중에