Topic modeling using Latent Dirichlet
Allocation

3 min readMar 29, 2022

Natural Language Processing is one the branch of Artificial Intelligence which has contributed a lot and solved many business problems. This is the domain which enables the computers to understand the human spoken languages. NLP is rapidly booming and improved in the access to data and increases in computational power, which are allowing practitioners to achieve meaningful results in areas like healthcare, media, finance, and human resources, among others. So for now I would like to share my knowledge on Latent Dirichlet Allocation.

Topic modeling is the process of identifying patterns in text data that correspond to a topic. If the text contains multiple topics, then this technique can be used to identify and separate those themes within the input text. This technique can be used to uncover hidden thematic structure in a given set of documents.Topic modeling helps us to organize documents in an optimal way, which can then be used for analysis. One thing to note about topic modeling algorithms is that they don’t need labeled data. It is like unsupervised learning in that it will identify the patterns on its own. Given the enormous volumes of text data generated on the internet, topic modeling is important because it enables the summarization of vast amounts of data, which would otherwise not be possible.

Latent Dirichlet Allocation is a topic modeling technique, the underlying
concept of which is that a given piece of text is a combination of multiple topics.Let’s consider the following sentence: Data visualization is an important tool in financial analysis. This sentence has multiple topics, such as data, visualization, and finance. This combination helps to identify text in a large document. It is a statistical model that tries to capture concepts and create a model based on them. The model assumes that documents are generated from a random process based on these topics. A topic is a distribution over a fixed vocabulary of words. Let’s see how
to do topic modeling in Python.

import pandas as pd
df = pd.read_csv('movie_data.csv', encoding='utf-8')

Next, we are going to use the already familiar CountVectorizer to create the
bag-of-words matrix as input to the LDA. For convenience, we will use scikit-learn’sbuilt-in English stop word library via stop_words=’english’ :

from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer(stop_words='english',
max_df=.1,
max_features=5000)
X = count.fit_transform(df['review'].values)

We set the maximum document frequency of words to be considered
to 10 percent ( max_df=.1 ) to exclude words that occur too frequently across
documents. The rationale behind the removal of frequently occurring words is that these might be common words appearing across all documents and are therefore less likely associated with a specific topic category of a given document.We limited the number of words to be considered to the most frequently occurring 5,000 words ( max_features=5000 ), to limit the dimensionality of this dataset so that it improves the inference performed by LDA.

from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_topics=10,
random_state=123,learning_method='batch')
X_topics = lda.fit_transform(X)

After fitting the LDA, we now have access to the components_ attribute of the lda instance, which stores a matrix containing the word importance (here, 5000 ) for each of the 10 topics in increasing order:

lda.components_.shape

Lets analyse the result:

n_top_words = 5
feature_names = count.get_feature_names()
 for topic_idx, topic in enumerate(lda.components_):
  print("Topic %d:" % (topic_idx + 1))
  print(" ".join([feature_names[i]
  for i in topic.argsort()\
        [:-n_top_words - 1:-1]]))

Results:

Topic 1:
worst minutes awful script stupid
Topic 2:
family mother father children girl
Topic 3:
american war dvd music tv
Topic 4:
human audience cinema art sense
Topic 5:
police guy car dead murder
Topic 6:
horror house sex girl woman
Topic 7:
role performance comedy actor performances
Topic 8:
series episode war episodes tv
Topic 9:
book version original read novel
Topic 10:
action fight guy guys cool

Thanks for reading my blog please do share, follow me and comment your feedback.

Topic modeling using Latent DirichletAllocation

Written by Atta Muhamamd

Topic modeling using Latent Dirichlet
Allocation