2. Statistical Inference
(1) Frequentists Inference
a) Confidence Interval
Frequentists view data as a random sample from a large population, and make a probability statements based on those populations.
What does the confidence interval of frequentists imply?
https://miro.medium.com/max/3840/
Let’s take an example.
One people flipped a coin 100 times, and got 44 heads and 56 tails. This test
follows a Bernoulli distribution with probability p.
Let’s say p is the case of the coin getting the heads.
So, what is his ‘estimate of p’ (=p_hat) and the Confidence Interval (=CI)?
From the Frequentists view, they will say that their estimate of p is ‘0.44’ (=44/100).
And the confidence interval 95% can be calculated as
( 44-1.96 * sqrt(100 * 0.44 * 0.56), 44+1.96 * sqrt(100 * 0.44 * 0.56) )
and this will become ( 0.343, 0.537 ).
Since 0.5(half, which represents a ‘fair’ coin) is inside this interval, which makes them say
that this coin is a fair coin.
So what is the meaning of the ‘95% confident’?
According to the frequentists’ view, it means that if they repeat this many times, and average of
95% of the intervals will contain the true value of p.
But, it has a problem. We can not know something about
this particular interval that we get each time. They can not answer the question, “What’s the probability of
this interval containing the true p?” They only consider whether the p is in that interval or not, so they can only answer by ‘0’ or ‘1’.
b) Likelihood function and Maximum Likelihood
I will not cover too much with this mle, assuming that you already learned.
In short, mle (maximum likelihood estimator) is a method of estimating the parameters of probability distribution
by maximizing a likelihood function, so that the observed data is most probable to have come from that distribution.
(2) Bayesian Inference
a) inference example
So, what is so good about Bayesian Inference, compared with Frequentist Inferece? It is that it allows you to easily incorporate prior information, which you know something before you look at the data. We’ll see how it works by an example.
Assume there is a 60% probability that the coin is a loaded coin. The ‘prior’ is that the probability of the coin is loaded. We can make use of this data when inferencing the posterior probability. Let’s say that we toss this coin 5 times, and compute
the posterior probability. In this case, there are only two outcomes for theta, which are ‘fair’ or ‘loaded’.
\(P(\theta = loaded) = 0.6\)
\(f(\theta | x) = \frac{f(x|\theta)f(\theta)}{\sum_{\theta}^{ }f(x|\theta)f(\theta)} = \begin{pmatrix} 5\\ 2 \end{pmatrix}[(0.5)^x(0.5)^{5-x}(0.4)I_{\theta=fair} + (0.7)^x(0.3)^{5-x}(0.6)I_{\theta=loaded}]\)
\(f(\theta|X=2) = 0.612I_{\theta=fair} + 0.388I_{\theta=loaded}\)
By this we can say that the when X is 2, the probability of the coin to be a loaded coin (theta=loaded) is 0.388. And this value will be different according to our prior belief. If we say that the probability of the coin is loaded is 0.5, the value will be 0.297, and if our prior is 0.9 the value will be 0.792
b) Continuous version of Bayes’ theorem
We can apply Bayes’ theorem into continuous version. It can be easily understood by the formula.
\(f(\theta|y) = \frac{f(y|\theta)f(\theta)}{f(y)} = \frac{f(y|\theta)f(\theta)}{\int f(y|\theta)f(\theta)d\theta} = \frac{likelihood\times prior }{normalizing\;\; constant}\)
We don’t have to compute the integral in the denominator. The posterior is a pdf of theta, but theta does not appear in f(y). So the absence of f(y) does not change the form of the posterior! That’s why we can only work with the numerator alone, which is more easier to compute.
NOTE) The only difference between the discrete case and continuous case of Bayes’s theorem is that the summation over all value of theta in the denominator is replaced with an integral in the continuous case!