Using Machine Learning to Categorize Texts into Topics

Develop an unsupervised learning algorithm to find subject commonalities

Henry Alpert
Towards Data Science

--

Image from Pixabay

After reading a news article — whether the subject matter is U.S. politics, a movie review, or a productivity tip — you can turn to someone else and give them a general idea of what it’s about, right? Or if you read a novel, you can classify it as maybe sci-fi, literary fiction, or a romance.

Humans tend to be pretty good at classifying texts. And these days, computers can do it, too.

For a recent machine learning project, I downloaded consumer complaints from the Consumer Financial Protection Bureau and developed models to classify the complaints into one of five product categories. My top-performing model did so correctly 86% of the time. (You can read about that project in this Medium blog.)

I used a supervised learning technique in which the models used labeled categories to then predict unlabeled categories. In the dataset used to train my models, the product categories were chosen by consumers themselves, and considering that most consumers aren’t financial experts, they may not have done the job 100% perfectly, and errors would have impacted the models’ performance.

Or think about a different dataset where the texts weren’t pre-labeled. It would be tedious for humans to do so in order to train a supervised Natural Language Processing (NLP) model. These downsides are inherent in supervised models.

As a follow up to the project, I wanted to develop an unsupervised learning NLP model to see what categories would arise. This model would ignore the pre-labeled categories and instead discern commonalities in order to group the texts into topics of its own devising.

I also had a business case in mind. An unsupervised model might be useful for the Consumer Financial Protection Bureau in that it wouldn’t have to rely on consumers for classifying their submissions, but also it could potentially process incoming messages into new, unforeseen categories. In essence, an unsupervised LDA model could work for any institution that receives messages from consumers or clients and help classify the texts.

Creating New Topics

I had already processed the complaints for the supervised learning models by tokenizing the texts, removing stopwords, and lemmatizing the words. (More details in the aforementioned blog). For the unsupervised modeling process, I then used Gensim’s Latent Dirichlet Allocation (LDA) module.

I had it divide the 160,000 complaints into five topics and then used the pyLDAvis module to create a visualization:

To see an interactive page of these topic visualizations, click here.

The visualization graphed the most common words in each of the five topics:

Topic 1

This topic includes general financial words like account, bank, card, money, and credit but also words like call, email, phone, letter, and received. A commonality in this topic has to do with communications.

Topic 2

With top words like credit, report, reporting, inquiry and bureau, this topic is about credit reporting.

Topic 3

Payment, loan, mortgage, due, interest, and balance tells me this topic concerns issues related to mortgages and loans.

Topic 4

Top words in this topic include debt, collection, law, violation, proof, and legal. It seems this topic concerns legal matters.

Topic 5

Here, some keywords are information, consumer, identity, theft, investigation, fraudulent, fair, and victim. I’d say this topic concerns identity theft or investigations into fraudulent activity.

Before I get into further discussion of these groupings, I’d like to point out that I could have chosen more or fewer than five topics. In fact, Genism also has module to measure the topic “coherence” of a model. I made models with different numbers of topics and compared them with the u-mass coherence measure. That measure yields a negative number with the closer the number being to zero, the more coherent it is.

As you can see, the model with five topics had the highest coherence score. At the same time, after I looked into coherence as a way to grade my models, it seems that many data scientists that focus on NLP don’t put too much stock in it. LDA is an unsupervised technique after all, and so finite measurements don’t necessarily supersede a human, nuanced interpretation of the results.

How Useful is the Unsupervised Model?

Let’s compare the five product categories of the supervised learning model with the five topic categories of my unsupervised learning model.

Two pairs of the topics overlap, but six topics are different. How can these findings be insightful or useful?

I can imagine there being at the Consumer Financial Protection Bureau an area that handles communications and consumer service issues, one that handles legal matters, and another one that handles identity theft and fraud. If so, complaints that the LDA model found belonged in one of these respective topics could be routed to the appropriate area.

In conclusion, an unsupervised model was indeed able to categorize incoming messages into new, unforeseen categories. These five categories were my based on my intuitive reading of top keywords and are certainly subject to deeper analysis and interpretation. Still, the greater point is that both supervised and an unsupervised NLP models could be useful to any organization looking to process a large amount of daily incoming messages.

--

--