Difficulty: beginner
Estimated Time: 20 minutes

Natural Language Processing aggregates several tasks that can be performed, like:

  • Part of speech tagging
  • Word segmentation
  • Named entity recognition
  • Machine translation
  • Question answering
  • Sentiment analysis
  • Topic segmentation and recognition
  • Natural language generation

It all starts though with preparing text for further processing. In this lab you will learn how to use some vanilla Python to clean and prepare text data.

Text Cleaning

Step 1 of 5

Read data

To start working with Python use the following command:


The first step in every text processing task is to read in the data. We'll be working with the Movie Reviews Corpus provided by the Python nltk library. You don't have to worry about this now as we've prepared the code to read the data for you.

We've written the read_reviews function in the data_reader module to help you get started and focus on language processing rather than on specifics of the dataset and data reading techniques. The data will be read into documents variable.

import data_reader documents = data_reader.read_reviews()

Once the documents and labels are read you can have a look at the data. You can check the number of examples.


As you can see we have 2000 reviews in the dataset. Pick and index number and see what some of then are doing.

example_idx = 75 document = documents[example_idx] document