Natural Language Processing aggregates several tasks that can be performed, like:
- Part of speech tagging
- Word segmentation
- Named entity recognition
- Machine translation
- Question answering
- Sentiment analysis
- Topic segmentation and recognition
- Natural language generation
One of them is classifying the text based on the content. In this scenario you will learn how to build Bag of Words model.
Introduction to Bag of Words
To start working with Python use the following command:
In this scenario we will be building Bag of Words model based on the Reuters dataset that is fortunately already imported and processed in the Python nltk library. We won't use the model further as there are libraries that provide better support. You will have the chance to work with them in the next scenario.
To read the corpus simply use the read_documents function from the data_reader module.
documents = data_reader.read_documents()
You can have a look at the data and see the number of documents.
It's good to have a look at some examples. Feel free to change the index and do your own exploration.
example_idx = 1024 document = documents[example_idx] document
As you can see the documents has some raw leftovers.