10 hours
- Hard
Last updated on 3/4/22
Preprocess Text Data
Evaluated skills
- Preprocess Text Data
Description
In this quiz, you're going to work on a classic text from The War of the Worlds, by H. G. Wells, a science fiction classic.
This book is freely available from Project Gutenberg. Download the text with:
import requests
result = requests.get('http://www.gutenberg.org/files/36/36-0.txt')
# This line to remove the header and footer
text = result.text[840:].split("*** END")[0]
# This line to remove all the weird non ascii characters
text = text.encode('ascii',errors='ignore').decode('utf-8')
If you print the first 230 characters of the text, you should see the following quote:
print(text[:230])
But who shall dwell in these worlds if they be inhabited?
. . . Are we or they Lords of the World? . . . And
how are all things made for man?
KEPLER (quoted in _The Anatomy of Melancholy_)
You will also need to install NLTK and download the NLTK stop words with:
import nltk
nltk.download('stopwords')
Question 1
Split the text along whitespaces, and lowercase the tokens.
How many distinct words do you have?
10518
10517
10051
10052
Question 2
Define the following character tokenizer function:
def chartokenizer(word):return [c for c in word]
Using this function, extract all the characters of a word or text into a list.Apply this function to the War of the Worlds text. Which of the following assertions are true?
Hint: You can add lists with the + operator:['c','a','t'] + ['d','o','g'] == ['c','a','t','d','o','g']
Careful, there are several correct answers.There are 274732 total characters and 70 unique chars.
The text contains left and right square brackets: [].
There are 274732 total characters and 44 unique lowercased chars
The number 8 is missing from the text.
Question 3
Look at the 20 most commons tokens obtained with the NLTK
WordPunctTokenizer
.Which of the following assertions are true?
from nltk.tokenize import WordPunctTokenizerfrom collections import Countertokens = WordPunctTokenizer().tokenize(text)Careful, there are several correct answers.The 20 most common words are stop words.
*the* and *The* appears more than 5000 times.
The 10 most common words make up more than 25% of all words.
WordPunctTokenizer
discards punctuation signs from the list of tokens.