• 10 hours
  • Hard

Free online content available in this course.


Got it!

Last updated on 1/28/21

Vectorize Text Using Bag-of-Words Techniques

Log in or subscribe for free to enjoy all this course has to offer!

Evaluated skills

  • Vectorize Text for Classification Using Bag-of-Words


In this quiz, you will be working with tweets related to vegetables!

To do so, we collected over 20,000 (:o) tweets that contained words like lettuce, broccoli, avocado, cauliflower. The dataset is available in the course GitHub repo in CSV format

You can load the dataset with the following code:

import pandas as pd
url = "https://raw.githubusercontent.com/alexisperrier/intro2nlp/master/data/openclassrooms_intro2nlp_sentiment_vegetables.csv"
df = pd.read_csv(url)
print(f"We have {df.shape[0]} tweets")

Next, load a few libraries, download stop words, and define a preprocessing function:

# a convenient module for punctuation signs
import string
punctuation_signs = [s for s in string.punctuation] + ['–','?','.','’']
# stopwords and tokenizer from NLTK
from nltk.corpus import stopwords
nltk_stopwords = stopwords.words("english")
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()
# Define a pre-processing function that removes punctuations signs, stop words and lowercase the tokens
# It returns a string of tokens separated by spaces.
def stopwords_punctuation(text):
tokens = tokenizer.tokenize(text)
tokens = [tk.lower() for tk in tokens if (tk not in punctuation_signs) and (tk not in nltk_stopwords) ]
return ' '.join(tokens)

Then apply the stopwords_punctuation function to the original tweets:

df['tokens'] = df.text.apply(stopwords_punctuation)

Now that all the preprocessing is done, let's dive in!

  • Question 1

    Open the dataset into a pandas DataFrame and start exploring. 
    As shown in the following script, there are 17,231 tweets, for a total of 245,621 tokens and 27,607 unique tokens.

    print(f"We have {df.shape[0]} tweets")
    # concatenate all the tokens
    all_tokens = []
    for tk in df.tokens.values:
    all_tokens += tk.split()
    print(f"We have {len(all_tokens)} tokens")
    # count unique tokens to know the vocab size
    print(f"We have {len(set(all_tokens))} unique tokens")

    What is the dimension of the document-term matrix? 

    • 245,621 rows; 17,231 columns

    • 27,607 rows; 245,621 columns

    • 17,231 rows; 27,607 columns

    • 17,231 rows; 24,5621 columns

  • Question 2

    Why is the document-term matrix is mostly filled with zeros?

    • Each cell represents the frequency of the token across the documents with a zero when the word is absent. Since words are usually present in a few documents only, most cells have the 0 value.

    • When words are too rare, you set the value to zero to reduce the vocabulary size.

    • Documents have a lot of words, and zeros help reduce the weight of the matrix.

    • Multiplying the matrix by itself nullifies the zeros, which is why they are needed in the first place.

  • Question 3

    With tf-idf, you normalize each token count (tf: term frequency) by measuring their frequency across the documents (idf: inverse frequency documents). Which assertions are true?

    Careful, there are several correct answers.
    • Terms with a higher tf-idf are more representative of a document because they are common in the document but rare in the collection. 

    • Terms with zeros across most documents can considered stop words.

    • Terms with tf-idf around 0.5 carry the most information in a document classification task, while high tf-idf values are to be avoided.

    • Terms with low tf-idf scores (but non zero) can be considered as stop words.