5. LDA (Latent Dirichlet Allocation) Model
2) LDA model
a. LDA Model
Goal : find a probabilistic model of a corpus that assigns high probability to members of the corpus ( & to other “similar” documents )
This is how the model looks like.
[ Interpretation ]
- : for each document ( ex. d=3 : document 3 )
- : generate topic probabilities ( ex. (0.5,0.2,0.3) )
- : for each word ( ex. N4 : word 4 )
- : select topic ( ex. with the probability vector theta )
- : select word from topic
Very Intuitive!
Let’s see the distributions more carefully.
The probability of theta is a dirichlet distribution with parameter alpha
The probability of topics (given theta) will be the component of theta.
To select the words, we need to know the probability of each words in the topic. (can find in the matrix Phi! row_num : Z_dn & col_num : w_dn )
We have to find the matrix Phi in the expression above. There are two constraints.
[ Summary ]
Known
- W data
Unknown
- ( parameters, distribution over words for each topic
- Z( latent variables, topic of each word )
- ( latent variables, distribution over topics for each document )
(2) E-step & M-step Overview
Goal : train the model by finding the optimal values of phi! ( by maximizing the likelihood )
If we take logarithm for the posterior distribution, it will seem like this.
We will use EM algorithm to find this distribution.
E step
M step
(3) E-step
As a result, we can express q(theta) like below