4.LDA (Latent Dirichlet Allocation) Intro

1) Introduction

a. Topic Modeling

https://miro.medium.com/max/2796/1*jpytbqadO3FtdIyOjx2_yg.png

A topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. (wikipedia)

Topic modeling treats document as a ‘distribution’ of topics. For example, let’s take a famous book ‘The Adventures of Sherlock Homes’. We can say that this book is composed of three topics, (60%)detective + (30%)adventure + (10%)horror. We can make this as a vector form like (0.6, 0.3, 0.1). Then if an another book called “Sherlock Homes and his friends” has a vector form (0.5,0.2,0.3), it means that this book is composed of (50%)detective + (20%)adventure + (30%)horror.

b. Similarity & Distance

After we have found the distribution of topic for some books or documnets (as a vector form), we can measure how those two books are similar or different in the aspect of topic. The commonly used measures for similarity & distance are ‘Euclidean distance’ and ‘Cosine similarity’. I will not cover about them, as you might all know.