5. LDA (Latent Dirichlet Allocation) Model

2) LDA model

a. LDA Model

Goal : find a probabilistic model of a corpus that assigns high probability to members of the corpus ( & to other “similar” documents )

This is how the model looks like.

[ Interpretation ]

  • : for each document ( ex. d=3 : document 3 )
  • : generate topic probabilities ( ex. (0.5,0.2,0.3) )
  • : for each word ( ex. N4 : word 4 )
  • : select topic ( ex. with the probability vector theta )
  • : select word from topic

Very Intuitive!

Let’s see the distributions more carefully.
The probability of theta is a dirichlet distribution with parameter alpha

The probability of topics (given theta) will be the component of theta.

To select the words, we need to know the probability of each words in the topic. (can find in the matrix Phi! row_num : Z_dn & col_num : w_dn )

We have to find the matrix Phi in the expression above. There are two constraints.


[ Summary ]

Known

  • W data

Unknown

  • ( parameters, distribution over words for each topic
  • Z( latent variables, topic of each word )
  • ( latent variables, distribution over topics for each document )

(2) E-step & M-step Overview

Goal : train the model by finding the optimal values of phi! ( by maximizing the likelihood )

If we take logarithm for the posterior distribution, it will seem like this.


( erasing the constant )

We will use EM algorithm to find this distribution.

E step

M step



(3) E-step


where

As a result, we can express q(theta) like below