• 10 hours
  • Hard

Free online content available in this course.


Got it!

Last updated on 3/4/22

Preprocess Text Data

Log in or subscribe for free to enjoy all this course has to offer!

Evaluated skills

  • Preprocess Text Data


In this quiz, you're going to work on a classic text from The War of the Worlds, by H. G. Wells, a science fiction classic.

This book is freely available from Project Gutenberg. Download the text with:

import requests
result = requests.get('http://www.gutenberg.org/files/36/36-0.txt')
# This line to remove the header and footer
text = result.text[840:].split("*** END")[0]
# This line to remove all the weird non ascii characters
text = text.encode('ascii',errors='ignore').decode('utf-8')

If you print the first 230 characters of the text, you should see the following quote:


But who shall dwell in these worlds if they be inhabited?
    . . . Are we or they Lords of the World? . . . And
    how are all things made for man?
                    KEPLER (quoted in _The Anatomy of Melancholy_)

 You will also need to install NLTK and download the NLTK stop words with:

import nltk
  • Question 1

    Split the text along whitespaces, and lowercase the tokens.

    How many distinct words do you have?

    • 10518

    • 10517

    • 10051

    • 10052

  • Question 2

    Define the following character tokenizer function:

    def chartokenizer(word):
    return [c for c in word]

    Using this function, extract all the characters of a word or text into a list. 

    Apply this function to the War of the Worlds text. Which of the following assertions are true? 

    Hint: You can add lists with the + operator:['c','a','t'] + ['d','o','g'] == ['c','a','t','d','o','g']

    Careful, there are several correct answers.
    • There are 274732 total characters and 70 unique chars.

    • The text contains left and right square brackets: [].

    • There are 274732 total characters and 44 unique lowercased chars

    • The number 8 is missing from the text.

  • Question 3

    Look at the 20 most commons tokens obtained with the NLTK WordPunctTokenizer.

    Which of the following assertions are true?

    from nltk.tokenize import WordPunctTokenizer
    from collections import Counter
    tokens = WordPunctTokenizer().tokenize(text)
    Careful, there are several correct answers.
    • The 20 most common words are stop words.

    • *the* and *The* appears more than 5000 times.

    • The 10 most common words make up more than 25% of all words.

    • WordPunctTokenizer discards punctuation signs from the list of tokens.