4. Models for Continuous data
(1) Exponential data
We have covered about the models for discrete data in the last post. Now we’re going to talk about the ‘continuous’ case. The first one is exponential data.
Gamma distribution is conjugate for an exponential likelihood. Actually, gammas are conjugate for a number of different things.
Let’s prove why gamma distribution is conjugate for an exponential likelihood.
Y follows an exponential distribution with parameter lambda.
\(Y \sim Exp(\lambda)\)
Let the prior distribution follow a gamma distribution
\(f(\lambda) \sim Gamma(\alpha,\beta)\)
And the posterior distribution would look like this.
\(\begin{align*} f(\lambda\mid y)&\propto f(y\mid \lambda)f(\lambda)\\ &\propto \lambda e^{-\lambda y}\lambda^{\alpha-1}e^{-\beta\lambda}\\ &\propto \lambda^{(a+1)-1}e^{-(\beta+y)\lambda} \end{align*}\)
\(\lambda\mid y \sim Gamma(\alpha+1, \beta + y)\)
We can see that the posterior distribution also follows a gamma distribution.
Example
Let’s put this into an example with a time waiting a bus. Say that the prior follows a distribution Gamma(100,1000), which means thah the prior mean is 1/10 ( usually wait 10 minutes for a bus to arrive ). And say we’re looking at plus or minus 0.02 as a possible range for this parameter ( 0.1-0.02 ~ 0.1+0.02 ). Suppose we have just waited for 12 minutes for a bus to arrive. If we put this result in to the equation above and calculate the posterior, the posterior distribution will follow (100+1,1000+12). As a result, the posterior mean will be 101/1012, which is almost 1/10.02 . We can say that this one data does not have a big impact in here.
(2) Normal data
Normal distribution is conjugate for itself! I will not make a proof of this, since it can be proved as the same way as above.
a. Normal likelihood with variance KNOWN
Let’s talk about the case when the variance of the normal likelihood is ‘known’.
Suppose the Xi’s follow the normal distribution like below
\(X_i \sim N(\mu, \sigma_{0}^2)\)
And let the prior distribution follow a normal distribution, with a mean m0 and standard deviation s0.
\(\mu \sim N(m_0, s_0^2)\)
Then the posterior distribution would look like this.
\(f(\mu\mid x) \sim f(x\mid \mu)f(\mu)\)
\(\mu|x \sim N(\frac{\frac{n\bar{x}}{\sigma_{0}^2}+\frac{m_o}{s_o^2}}{\frac{n}{\sigma_o^2}+\frac{1}{s_o^2}},\frac{1}{\frac{n}{\sigma_o^2}+\frac{1}{s_o^2}})\)
As a result, the posterior mean would look like this. This is an weighted average of the ‘prior mean’ and the ‘data mean’. And the effective sample size for this prior is the ratio of the variance of the data, to the variance of the prior.
\(\frac{n}{n+\frac{\sigma_o^2}{s_o^2}}\bar{x} + \frac{\frac{\sigma_o^2}{s_o^2}}{n+\frac{\sigma_o^2}{s_o^2}}m_o\)
We can notice that as the variance of the prior gets larger, the less weight it gets(less information in it), and the smaller variance it has, the more weight it gets(more information in it). It makes sense!
b. Normal likelihood with variance UNKNOWN
But what if the mean and the variance is unknown? Then we can specify a conjugate prior in a hierarchical fashion.
What does it mean by hierarchical fashion?
We first make an X following a normal distribution , given mu and sigma. And then, we make a mu following an normal distribution, given sigma. Look at the expression below, and you can understand what it means.
\(X_i\mid \mu,\sigma^2 \sim N(\mu, \sigma^2)\)
\(\mu\mid \sigma^2 \sim N(m, \frac{\sigma^2}{w})\)
In the expression above, w is an effective sample size, which can be expressed like \(w = \frac {\sigma^2}{\sigma_\mu^2}\)
And the conjugate prior (for sigma) is an inverse gamma distribution.
\(\sigma^2 \sim InvGamma(\alpha, \beta)\)
Then the distribution of sigma, given x will look like this.
\(\sigma^2 \mid x\sim InvGamma(\alpha+\frac{n}{2},\beta+\frac{1}{2}\sum_{i=1}^{n}(x_i-\bar{x})^2+\frac{nw}{2(n+w)}(\bar{x}-m)^2)\)
\(\mu\mid \sigma^2,x \sim N(\frac{n\bar{x}+wm}{n+w},\frac{\sigma^2}{n+w})\)
The posterior mean here can be decomposed like the below ( weighted average of prior mean & data mean )
\(\frac{n\bar{x}+wm}{n+w} = \frac{w}{n+w}m + \frac{n}{n+w}\bar{x}\)
If we only care about the mean(mu), which doesn’t depend on the sigma, we can marginalize the sigma. Then we can see that
the mu follows an t distribution.
\(\mu \mid x \sim t\)
(3) Alternative priors
a. Non-informative priors
As the word it says, the prior is non informative. It is an approach to have the data have maximum influence on the ‘posterior’.
Let’s take an example of flipping a coin. Say theta is a probability of the coin coming up heads. How can we minimize our prior information? One good example can be a theta, following an uniform distribution on the interval 0,1. It does not give any information at all. ( since all values of theta are equal ) But even though this seems non-informative, it is not a completely non-informative prior. As you know, uniform distribution is a beta distribution with alpha 1, beta 1. (effective sample size is 2(=1+1) ). To make it much more NON informative, we can reduce these in to Beta(0.0001,0.0001) ( or even further ). We can even go to a limiting case, of beta(0,0). As you might know, beta(0,0) is not a proper density. It has an infinite integral (does not integrate to 1). This is what we call “improper prior”. Let’s see with the mathematical expression.
\(Y_i \sim B(\theta)\)
\(\theta \sim Beta(0,0)\)
\(f(\theta) \propto \theta^{-1}(1-\theta)^{-1}\)
\(f(\theta\mid y) \propto \theta^{y-1}(1-\theta)^{n-y-1}\sim Beta(y,n-y)\)
As you can see above, we get a posterior which gives us point estimates same as the frequentist approach! There was no problem even though we have used the prior which is improper.
The key points that I want to say about is that
- 1 ) It is okay to use improper priors, as long as the posterior itself is proper.
- 2 ) For many problems, there exists a prior that is improper.
b. Jeffery’s prior
Even though we choose a prior with the same distribution, let’s say uniform distribution, it depends upon the particular parameterization.
Think about normal distribution with uniform prior.
\(Y_i \sim N(\mu,\sigma^2)\)
Some might use prior for sigma like below. ( which is uniform on the log scale of sigma squared )
\(f(\sigma^2) \propto \frac{1}{\sigma^2}\)
And some might use prior like this.
\(f(\sigma^2) \propto 1\)
Although these are both uniform ( with different scales, or different parameterizations), they are different priors. So the posterior computed by this will also become different. It means that the uniform priors are not invariant with transformation. That is why Jeffery came up with Jefferys prior, which is a non-informative prior which is invariant to the parameterization used.
Jeffery prior is the prior that satisfies \(p(\theta) \propto J(\theta)^{1/2}\), where J(\(\theta\)) is the Fisher Information for theta.
In the case of binomial model, Jeffery prior is
\(\theta \sim Beta(0.5,0.5)\)