Last updated on 3/4/22
Preprocess Text Data
- Preprocess Text Data
In this quiz, you're going to work on a classic text from The War of the Worlds, by H. G. Wells, a science fiction classic.
This book is freely available from Project Gutenberg. Download the text with:
import requestsresult = requests.get('http://www.gutenberg.org/files/36/36-0.txt')# This line to remove the header and footertext = result.text[840:].split("*** END")# This line to remove all the weird non ascii characterstext = text.encode('ascii',errors='ignore').decode('utf-8')
If you print the first 230 characters of the text, you should see the following quote:
But who shall dwell in these worlds if they be inhabited?
. . . Are we or they Lords of the World? . . . And
how are all things made for man?
KEPLER (quoted in _The Anatomy of Melancholy_)
You will also need to install NLTK and download the NLTK stop words with:
Split the text along whitespaces, and lowercase the tokens.
How many distinct words do you have?
Define the following character tokenizer function:def chartokenizer(word):return [c for c in word]
Using this function, extract all the characters of a word or text into a list.
Apply this function to the War of the Worlds text. Which of the following assertions are true?
Hint: You can add lists with the + operator:
['c','a','t'] + ['d','o','g'] == ['c','a','t','d','o','g']Careful, there are several correct answers.
There are 274732 total characters and 70 unique chars.
The text contains left and right square brackets: .
There are 274732 total characters and 44 unique lowercased chars
The number 8 is missing from the text.
Look at the 20 most commons tokens obtained with the NLTK
Which of the following assertions are true?from nltk.tokenize import WordPunctTokenizerfrom collections import Countertokens = WordPunctTokenizer().tokenize(text)Careful, there are several correct answers.
The 20 most common words are stop words.
*the* and *The* appears more than 5000 times.
The 10 most common words make up more than 25% of all words.
WordPunctTokenizerdiscards punctuation signs from the list of tokens.