Created on 11-21-2018 08:17 PM - edited 08-17-2019 05:39 AM
Naïve Bayes is a machine learning model that is simple and computationally light yet also accurate in classifying text as compared to more complex ML models. In this article I will use the Python scikit-learn libraries to develop the model. It is developed on a Zeppelin notebook on top of the Hortonworks Data Platform (HDP) and uses its %spark2.pyspark interpreter to run python on top of Spark.
We will use a news feed to train the model to classify text. We will first build the basic model, then explore its data and attempt to improve the model. Finally, we will compare performance accuracy of all the models we develop.
A note on code structuring: Python import statements are introduced when required by the code and not all at once upfront. This is to relate the packages closely to the code itself.
The Zeppelin template for the full notebook can be obtained from here.
We will use news feed data that has been classified as either 1 (Business), 2 (Sports), 3 (Business) or 4 (Sci/Tech). The data is structured as CSV with fields: class, title, summary. (Note that later processing converts the labels to 0,1,2,3 respectively).
The class, title and summary fields are appended to their own arrays. Data cleansing is done in the form of punctuation removal and conversion to lowercase.
The scikit-learn packages need these arrays
to be represented as dataframes, which assign an int to each row of the array
inside the data structure.
Note: This first model will use summaries to classify test, as shown in the code.
Now we start using the machine learning packages. Here we convert the dataframes into a vector. The vector is a wide n x m matrix with n records and for each record m fields that hold a position for each word detected among all records, and the word frequency for that record and position. This is a sparse matrix since most m positions are not filled. We see from the output that there are 7600 records and 20027 words. The vector shown in the output is partial, showing part of the 0th record with word index positions 11624, 6794, 6996 etc.
Let’s fit the model. Note that we split the data into a training set with 80% of the records and validation step with the remaining.
Wow! 87% of our tests accurately predicted the text classification. That’s good.
We can look more deeply than the single accuracy score reported above. One way is to generate a confusion matrix as shown below. The confusion matrix shows for any single true single class, the proportion of predictions it made against all predicted classes.
We see that Sports text almost always predicted its classification correctly (0.96). Business and Sci/Tech were a bit more blurred: When Business text was incorrectly predicted, it was usually against Sci/Tech and the converse for Sci/Tech. This all makes sense since Sports vocabulary is quite distinctive and Sci/Tech is often in the Business news.
There are other views of model outcomes ... check the sklearn.metrics api.
Now take a single news feed summary and see what the model predicts.
I have run may through the model and it performs quite well. The new text in shown above gets a clear Business classification. When I run news summaries on cultural items (no category in the model), the predictions are low and spread across all categories, as expected.
Let’s get the top 25 words and anagrams (phrases, in our case two words) among all training set text that were used to build the model. These are shown below.
Hmm ... there are a lot of common meaningless words involved. Most of these are known as stopwords in natural language processing. Let’s remove the stop words and see if the model improves.
The above are stopwords from a file. The below allows you to iteratively add stopwords to the list as you explore the data.
Now we can see the top 25 words and anagrams after the stopwords are removed. Note how easy this is to do: we instantiate the CountVectorizer exactly as before, but by passing a reference to the stopword list.
This is a good example of how powerful the skilearn libraries are: you interact with the high level apis and the dirty work is done under the surface.
Now we train the same model for news feed summaries with no stop words (left) and for titles and no stopwords.
Interesting ... the model using summaries with no stop words is equally accurate as the one with them included in the text. Secondly, the titles model is less accurate than the summary model, but not by much (not bad for classifying text from samples of only 10-20 words).
I trained each of the below models 5 times each: news feed text from summaries (no stops, with stops) and from titles (no stops, with stops). I averaged the accuracies and plotted as shown below.
The zeppelin notebook for this article https://github.com/gregkeysquest/DataScience/tree/master/projects/naiveBayesTextClassifier/newsFeeds