Natural Language Processing aggregates several tasks that can be performed, like:
- Part of speech tagging
- Word segmentation
- Named entity recognition
- Machine translation
- Question answering
- Sentiment analysis
- Topic segmentation and recognition
- Natural language generation
One of them is classifying the text based on the content. In this scenario you will learn how to build Bag of Words model.
You've completed Bag of Words scenario.

Steps
Introduction to Bag of Words
Read data
To start working with Python use the following command:
python
In this scenario we will be building Bag of Words model based on the Reuters dataset that is fortunately already imported and processed in the Python nltk library. We won't use the model further as there are libraries that provide better support. You will have the chance to work with them in the next scenario.
To read the corpus simply use the read_documents function from the data_reader module.
import data_reader
documents = data_reader.read_documents()
You can have a look at the data and see the number of documents.
len(documents)
It's good to have a look at some examples. Feel free to change the index and do your own exploration.
example_idx = 1024
document = documents[example_idx]
document
As you can see the documents has some raw leftovers.