Welcome!
Text Classification
Natural Language Processing aggregates several tasks that can be performed, like:
- Part of speech tagging
- Word segmentation
- Named entity recognition
- Machine translation
- Question answering
- Sentiment analysis
- Topic segmentation and recognition
- Natural language generation
One of them is classifying the text based on the content. In this scenario you will learn how to use Bag of Words and td-idf models to perform the task.
Congratulations!
You've completed the scenario!
Scenario Rating
You've completed Text Classification scenario.
Your environment is currently being packaged as a Docker container and the download will begin shortly. To run the image locally, once Docker has been installed, use the commands
cat scrapbook_basiafusinska_nlp-with-python/text-classification_container.tar | docker load
docker run -it /basiafusinska_nlp-with-python/text-classification:
Oops!! Sorry, it looks like this scenario doesn't currently support downloads. We'll fix that shortly.

Steps
Text Classification
Text Classification
Text Classification tasks starts with providing training set: documents and categories (labels) to the Machine Learning algorithm. After the model is trained it can be used to categorize new examples.
Text representation brings some complexity when forming machine learning problem. Usually the dataset has the form of rows organized into features.
In our case every document is a data point, label is a category, but what would features be?
Read data
To start working with Python use the following command:
python
The first step in every text processing task is to read in the data. We'll be working with the Reuters dataset that is fortunately already imported and processed in the Python nltk library.
We've written the read_train_test function in the data_reader module to help you get started and focus on language processing rather than on specifics of the dataset and data reading techniques. As we will be dealing with the classification task, the result of the function is returning train and test datasets and corresponding labels. At the moment you will use documents, transforming them to the form that can be used to be fed into the machine learning algorithm. At the end you will also use labels.
import data_reader
train_documents, train_labels, test_documents, test_labels = data_reader.read_train_test()
Once the documents and labels are read you can have a look at the data. The code here is displaying the number of examples in both training and test datasets. Then we use set to get unique values from labels, which will constitute to the available categories.
print("Number of documents in the training set: ", len(train_documents))
print("Number of documents in the test set: ", len(test_documents))
categories = set(train_labels)
print("Categories (", len(categories), "): ", categories)
In the next command you can have a look at the content of the documents. Change the index by replacing the example_idx value and print the document and corresponding category.
You may see that documents still require some formatting, or that they have html elements till in their content.
example_idx = 75
print("Example document (category: ", train_labels[example_idx], "):")
print(train_documents[example_idx])
Bag of Words
Now it's time to start feature generation for the read documents. First we will use simple Bag of Words model, then moving to tf-idf.
Bag of Words is a model that requires building the vocabulary and the assigning a number of occurences for every word in the document to the proper index. For example if the vocabulary is:
[ also, and, both, football, games, john, like, likes, mary, movie, movies, titanic, to, too, watch ]
the sentence:
"John likes to watch movies. Mary likes to watch movies too."
will become the following vector:
[ 0, 0, 0, 0, 0, 1, 0, 2, 1, 0, 1, 0, 2, 1, 2 ]
Python sklearn library offers CountVectorizer class to do this task for you. It takes collection of texts, builds the vocabulary based on words that appear in it and transforms it to the count vectors. It makes the preprocessing easier as you don't have to split the text, count occurences and build your vocabulary. It also uses efficient data structures to represent the vectors as they will be sparsed.
The following function is a helper that takes the vectorizer object, fits to the provided collection and transforms it according to the built model. Additionally it prints the vocabulary (feature names). As a default it will extract every word defined as the alphanumeric one with the length of at least 2.
from sklearn.feature_extraction.text import CountVectorizer
def fit_transform_vectorizer(vectorizer, print_features=True):
one_hot = vectorizer.fit_transform(train_documents)
if print_features:
print(vectorizer.get_feature_names())
return one_hot
All we have to do now is to use the function and pass a vanilla CountVectorizer. Let's see what happens.
vectorizer = CountVectorizer()
one_hot = fit_transform_vectorizer(vectorizer)
Count Vector
Wow, this is a lot of words. Do you think all of them are needed for our model? Probably not. How about we limit them to the 500 most common ones?
Fortunately we don't have to do it by hand as CountVectorizer has max_features parameter in the constructor. We'll limit the vocabulary to 500. Feel free to change this.
vectorizer = CountVectorizer(max_features=500)
one_hot = fit_transform_vectorizer(vectorizer)
This is a clearer list, isn't it? But what strikes me immediately is all the numbers appearing at the beginning of the list. I would like to get rid of them as I don't think their values would bring anything to the classification task. Actually I would like to match the words to be only letters and having at least 2 letters.
This can be achieved by using regular expressions and again CountVectorizer has the parameter (token_pattern) that can handle it.
vectorizer = CountVectorizer(max_features=500, token_pattern='[a-zA-Z]{2,}')
one_hot = fit_transform_vectorizer(vectorizer)
Much tidier now, but there are still some problems I can see. One is that vocabulary uses a lot of stop words which are commonly appearing words that don't bring much meaning of context. These are words like:
a, an, and, both, do, have, how, is, it, I, more, much, my, on, one, so, the, this, to, too, very, what, who, where, you
... and many more. Every language have their on list and different algorithms and teams use slightly changed lists. Sometimes you may need to decide the list of your stop words.
CountVectorizer has a handler for english language. For other languages the parameter stop_words has to be set to be a list.
vectorizer = CountVectorizer(max_features=500, token_pattern='[a-zA-Z]{2,}', stop_words='english')
one_hot = fit_transform_vectorizer(vectorizer)
Now the words seem to bring more meaning. The last issue is stemming. In a nutshell it is bringing the word to it's root value. You may have notices that we have for example sets like: acquire - acquired, firm - firms or product - production being set as separate features. We may want to merge to the base values.
There are several stemmers out there so you don't have to write your own (it's actually quite a complex although interesting task). We will use PorterStemmer imported from nltk library. This time our vectorizer construction has to change a little bit. We will first build the analyzer and then use it to initialize the final one.
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
analyzer = CountVectorizer(stop_words='english', token_pattern='[a-zA-Z]{2,}').build_analyzer()
vectorizer = CountVectorizer(max_features=500, analyzer=(lambda text: (stemmer.stem(word) for word in analyzer(text))))
one_hot = fit_transform_vectorizer(vectorizer)
Now we can see how this representation looks like and then how well it can be used to predict the category of text. First let's have a peek on the result of transformation and display one example. Feel free to change the index and check others.
print(one_hot.toarray())
example_idx = 73 # Feel free to change the index
print(one_hot.toarray()[example_idx])
Do you see now what I meant about sparcity? Most of the words don't appear in the sentence and we also limite3d our vocabulary so rare and stop words won't be represented as features.
tf-idf
Another way to get the feature vector is to use tf-idf model. It is based on calculating the frequency of words based on both appearing in the sentence and in the whole dataset. The method of calculation can represented as follows:
sklearn again offering a class (TfidfVectorizer) to do the calculations for us. As we've learned from the previous experience we'll use the other techniques and just change the type of vectorizer assigning the transformation to the separate value.
from sklearn.feature_extraction.text import TfidfVectorizer
analyzer = TfidfVectorizer(stop_words='english', token_pattern='[a-zA-Z]{2,}').build_analyzer()
vectorizer_tfidf = TfidfVectorizer(max_features=500, analyzer=(lambda text: (stemmer.stem(word) for word in analyzer(text))))
tfidf = fit_transform_vectorizer(vectorizer_tfidf)
The vocabulary looks exactly the same but the representation differ. Instead of counts it should now contain real values representing tf-idf index. Let's have a look at the result of transformation.
print(tfidf.toarray())
example_idx = 73
print(tfidf.toarray()[example_idx])
Classification model
We will use now those two representations (vector count and tf-idf) to train two separate models. As the machine learning algorithm we will use multinomial Naive Bayes which is offered in sklearn by class MultinomialNB. Once we fit to the training data (which were created by transforming train_documents), the model can be used to make future predictions.
from sklearn.naive_bayes import MultinomialNB
one_hot_model = MultinomialNB().fit(one_hot, train_labels)
tfidf_model = MultinomialNB().fit(tfidf, train_labels)
Let's check how both models perform on the training and test datasets. As we already have the vectorised presentations for the training set, we need to do the same for the test documents. This time we will use transform method for both vectorizers ans we no longer need to fit them.
Then we use underlying truth (train_labels and test_labels) to score both models.
test_one_hot = vectorizer.transform(test_documents)
test_tfidf = vectorizer_tfidf.transform(test_documents)
print("One hot train score: ", one_hot_model.score(one_hot, train_labels))
print("One hot test score: ", one_hot_model.score(test_one_hot, test_labels))
print("Tfidf train score: ", one_hot_model.score(tfidf, train_labels))
print("Tfidf test score: ", one_hot_model.score(test_tfidf, test_labels))
The results are not amazing but sufficient enough regarding we only did basic feature engineering and didn't really tune the algorithms. The model show good generalization, but to trully evaluate them other matrics should be used to account for classes imbalance.