An LDA based Topic Modelling Approach to Understanding Consumer Queries from Unstructured Chat Texts
Background
In today’s world, customer satisfaction is a primary goal for most companies. Companies employ various tech and nontech strategies to ensure that the customer is always engaged and concerns are addressed immediately. In a post COVID world with more dependency on remote working – there has been a spike in technology use and thus also a rise in technology-related issues which need to be addressed. Chatbots have been extensively used as part of the automation industry but chatbots can be trained only to do so much; Thus there still remains the need for human intervention which is costly and time-consuming, therefore the need exists to develop a tool that can help to reduce the manual intervention to some extent.
In this article, we propose an automated methodology involving Latent Dirichlet Allocation (LDA) Topic Modeling to understand the customer queries and gain insight from them so that they can be easily segmented into categories. The technique not only identifies key areas of concern but also allocates a topic to each customer query. Further applications of this methodology is not limited to consumer queries but can also be extended to understand YouTube comments/ social media comments and get an insight into what the user thinks about a particular video/post
Problem Statement
The problem at hand is to analyze unstructured data in the form of consumer queries and identify the relevant topic of concern from the unstructured text. Once the key areas are identified they are ‘topic’ labeled and returned for easy segmentation for down the line processing
Technical Challenges in the Problem
The raw data could have the following problems:
- Spelling errors
- Occurrence of high-frequency words which are related to the domain of business. These words could overshadow the actual areas of concern.
- Presence of commonplace words which have no value with respect to the topic of concern
- Multiple aliases of the same word
- Usage of special characters
Initial Approaches
We have tried a couple of approaches which proved to be unsuccessful in this case but let us re-design our path to the right method; we first discuss these preliminary approaches which will help us understand why the final method was chosen.
Approach 1
Firstly, we tried to run a simple LDA model on the corpus using the sci-kit learn library, post-cleaning the data with some basic preprocessing steps. The results from this approach were not satisfactory and the topics obtained were overlapping. Fig 1 Below shows the flow chart for this approach and initial results are presented in Fig. 2


Approach 2
In a revised approach we tried to look at words occurring together instead of analyzing words individually – In this method, we tried to look at ngrams instead of unigrams/words. We went ahead with the thought that ngrams would provide more meaning to the context and thus we tried to analyze bi-grams and tri-grams using the ngram_range parameter of the CountVectorizer to look at all the bigrams and trigrams in the corpus. We ran the LDA model on this modified corpus. However, the results were still not satisfactory. Fig 3. gives the approach flow chart and Fig. 4 shows the results from this method


Final Approach
We then decided to first identify important ngrams only based on a metric called Pointwise Mutual Information (PMI) score and then run the LDA model on these ngrams. This approach provided promising results and hence we finalized it. Details on this approach are elaborated on below.
Solution Methodology
We follow a three-step approach to solve the problem. Firstly, we perform routine text preprocessing on the corpus, like tokenization, lemmatization, special character removal, etc. We also create a set of commonly occurring words that are specific to the domain (in our case, flights, fly, station, city names, ticket, etc.) and remove them so that they don’t overshadow the underlying topics.
Secondly, we identify important trigrams in the corpus. Using the collocations library in nltk and the PMI score, we identify trigrams that occur frequently. These trigrams are then embedded in the corpus. wherever these trigrams occurred in the queries, we grouped them together into one word using underscores.
Lastly, the LDA model was applied to this augmented corpus, and topics were generated. We looked at the top 10 words for each topic to get an idea about what each topic is about. Then, each query in the corpus was assigned a topic based on the probabilities assigned for each topic.
Solution Details
Now we discuss the detailed workflow of the solution methodology and the various steps involved. Fig. 5 is a schematic of this process.

Step 1: Text Preprocessing
We start with cleaning the text data. We perform routine tokenization, remove additional characters, remove words shorter than 3 letters, etc. We manually curate a list of domain-specific common words so that they don’t overshadow the underlying topics. These words are treated as stop words. Lastly, the words are lemmatized using WordNet lemmatizer and a part-of-speech finder.
Step 2: Identification of important trigrams
Next, we identify important trigrams, using the collocation library and the PMI score. A frequency filter was also applied to obtain just the frequently occurring trigrams. Identifying these trigrams also helped us to identify other domain-specific words/phrases like city names (Las Vegas, San Jose, Salt Lake City, New Mexico, etc. The obtained trigrams were then embedded into the corpus, which means if a query contained any of these trigrams, we clubbed them together into one word using underscores. Figs 6 and 7 show relevant code snippets, Fig. 8 shows the output from this approach.



Step 3: Fitting the LDA model
After augmenting the corpus with the important trigrams, we now decided to run the LDA model on the corpus. First, we vectorized the corpus using the CountVectorizer function. We also used the ngram_range parameter of CountVectorizer to just look at trigrams in the corpus. This was decided after testing out different values for ngram_range. Then we used the scikit-learn LDA model to generate topics for the corpus. Top words for each topic were identified to get an idea about the topic. We then visualized the topics using pyLDAvis. Fig. 9 is the code snippet for fitting the LDA model.

Results and Visualization of topics
Given below is the list of keywords for all the topics obtained. One can go through these keywords to understand the underlying themes and areas of concern. Many areas of concern and themes were identified, for example- long-haul flight, claim refused, password resetting issues, credit card issues, slow internet, and so on. Refer to the image below to see how each document is being assigned a topic. Fig. 10 depicts the list of keywords obtained using this approach
Further, we decided to visualize the topics obtained so as to get a sense of their frequency and overlap among them. We used a library called ‘pyLDAvis’ which made it easier to visualize these topics with very few lines of code. It can be seen from the visualization that there is very little overlap between the topics and they are well spread out indicating good results. This is depicted in Fig. 11. We also see the Final output in Fig. 12



Future Work
There is a lot of scope for future work on this topic. Here are some areas which future researchers can work on
- There is some overlap in the topics obtained. Future researchers can work to reduce this overlap and aim for mutually exclusive and exhaustive topics.
- Some documents are getting equal probability for each topic and hence the first topic is being assigned to it. Future researchers can use some techniques to deal with this.
- Deciding the number of topics to choose is a key step here. Researchers could devise a method to choose the optimal number of topics for the given corpus.
Summary
In this project, we tried to understand consumer queries using an LDA-based approach and were able to find topics of concern underlying the text data. Trying different approaches and methods, we were able to come up with a technique to identify topics of concern. This work can be extended to other areas like understanding YouTube comments or any other types of user feedback.
References
Author Biographies
Gurjote Singh
https://www.linkedin.com/in/gurjote27
Gurjote is a final year mathematics undergraduate at the University of Delhi and a Data Science student at IIT Madras. His interests lie in NLP and traditional machine learning. Gurjote is also one of the 35 Tableau Student Ambassadors chosen from around the world and has helped 200+ students get started with Tableau. Since 2019, he has been involved in multiple projects and internships in the data science space.
Krishnamurthy Narayanaswamy
https://www.linkedin.com/in/krishnamurthy-narayanaswamy-3675ab17
Krishna is currently a program manager with HP. He is an experienced business analyst with a demonstrated history of working in the information technology and services industry. Skilled in Operations Management, power Bi, Sentiment Analytics, Analytical Skills, Requirements Analysis, Customer Relationship Management (CRM), and Six Sigma. Strong business development professional with a Master of Business Administration (M.B.A.) focused in Finance from S.I.E.S. College of Management Studies and a Post Graduate Diploma in Business Analytics from XLRI.
Dr. Anish Roychowdhury
https://www.linkedin.com/in/anish-roychowdhury-ph-d-a2463714/
Dr. Anish Roy Chowdhury is currently an Industry Data Science Leader currently Leading the DS space for Manufacturing at Dr. Reddy’s Labs and also an Adjunct faculty with SP Jain. In previous roles, he was leading the Data Science Team of a FinTech Startup, and in prior engagements was with ABInBev as a Data Science Research lead working in areas of Assortment Optimization, Reinforcement Learning to name a few, He also led several machine learning projects in areas of Credit Risk, Logistics and Sales forecasting. In his stint with HP Supply Chain Analytics, he developed data Quality solutions for logistics projects and worked on building statistical models to predict spares part demands for large format printers. Prior to HP, he has 6 years of Work Experience in the IT sector as a Database Programmer. During his stint in IT, he has worked for Credit Card Fraud Detection among other Analytics-related Projects. He has a Ph.D. in Mechanical Engineering (IISc Bangalore). He also holds an MS degree in Mechanical Engineering from Louisiana State Univ. The USA. He did his undergraduate studies from NIT Durgapur with published research in GA- Fuzzy Logic applications to Medical diagnostics.
An LDA based Topic Modelling Approach to Understanding Consumer Queries from Unstructured Chat Texts - Analytics Insight
Read More

No comments:
Post a Comment