• 10 hours
  • Hard

Free online content available in this course.


Got it!

Last updated on 3/4/22

Apply a Simple Bag-of-Words Approach

Log in or subscribe for free to enjoy all this course has to offer!

Understand the Meaning of Text Vectorization

Linguistics has attempted to derive universal rules behind languages. The difficulty is that there's always a turn of phrase, an idiom, or some new slang, that will create exceptions.

On the other hand, natural language processing leverages machine learning to parse, analyze, predict, classify, correct, translate, or generate written text.

But wait...machine learning models use numbers, not letters?

Exactly. So you need to do just that! This process is called vectorization.

Vectorization is the general process of turning a collection of text documents (a corpus) into numerical feature vectors fed to machine learning algorithms for modeling.

When you vectorize the corpus, you convert each word or token from the documents into an array of numbers. This array is the vector representation of the word.

A corpus, you say? o_O

In this chapter, you will discover the fundamentals of a simple yet efficient vectorization technique called bag-of-words and apply it to text classification. 

Understand the Bag-of-Words Approach

Bag-of-words (BOW) is a simple but powerful approach to vectorizing text.

As the name may suggest, the bag-of-words technique does not consider the position of a word in a document. The idea is to count the number of times each word appears in each of the documents. This approach may sound silly and crude since, without proper word order, humans would not understand the text at all. That's the beauty of the bag-of-words system: it is simple, but it works.

Consider the three following documents (sentences) from the well known Surfin' Bird song, and count the number of times each word appears in each sentence.

A table counting each time a word appears in  each of the following sentences:
BOW on Surfin' Bird

Each word is now associated with its own column of numbers, its own vector:

We give each word its own vector.  Bird -> (5,1,1). Bird appears 5 times in sentence 1, 1 time in sentence 2 and 1 time in sentence 3.  the -> (2,1,2) word -> (0,0,1)
Each word has its own vector

Note that the size of the document-term matrix is:

 number of documents size of vocabulary

The vector size of each token equals the number of documents in the corpus. Large corpuses have long vectors, and small corpuses have short vectors (as in the Surfin' Bird example above, where each vector only has three numbers).

Create a Document-Term Matrix

The bag-of-words method is commonly used for text classification purposes where the frequency of each word is used as a feature for training a classifier. 

Consider the following sentences. The first two are about recipes, and the last three are about computing.

  • Take 2 cups of flour.

  • Mix the flour with the eggs.

  • Replace your keyboard in 2 minutes.

  • Do you prefer Windows or Mac?

  • The Mac book pro has such a noisy keyboard.

You can use the CountVectorizer from scikit-learn (you can read more on the official documentation page) to generate the document-term matrix with the following code:

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = [
'2 cups of flour',
'replace the flour',
'replace the keyboard in 2 minutes',
'do you prefer Windows or Mac',
'the Mac has the most noisy keyboard',
X = vectorizer.fit_transform(corpus)

It returns the document-term matrix:

Document-Term Matrix from CountVectorizer
Document-term matrix from CountVectorizer

Each row corresponds to one of the sentences, and each column to a word in the corpus. For instance, the appears once in documents two and three and twice in document five, while the word flour appears once in documents one and two. The vocabulary is strongly related to the sentence topic: the word flour only appears in documents about recipes. On the other hand, the is less specific.

Handle a Large Matrix

The vocabulary is the set of all unique tokens in a corpus. It's size directly impacts the dimension of the document-term matrix. Reducing the size of the vocabulary is important to avoid performing calculations over gigantic matrices.

While removing stop words and lemmatizing helps reduce the size of the vocabulary significantly, it's often not enough.

Imagine that you are working on a corpus of 10,000 news articles with a total overall vocabulary of 10,000 tokens after lemmatization and removing stop words. The corresponding document-term matrix would have a 10k by 10k dimension. That's huge! Using such a matrix to train a classification model will lead to long training times and memory consumption.

Therefore, reducing the size of the vocabulary is crucial. The idea is to remove as many tokens as possible without throwing away relevant information. It's a delicate balance that is entirely dependent on the context. One strategy can be to filter out words that are either too frequent or too rare. Another strategy involves applying dimension reduction techniques (PCA) to the document-term matrix. For a quick reminder about how PCA words, check out the OpenClassrooms course called Perform an Exploratory Data Analysis.

Build a Classifier Model Using Bag-of-Words

Now that we know how to create a document-term matrix, we can apply it to a text classification!

To recap, the typical machine learning process to train a classifier broadly follows these steps:

  1. Feature extraction (vectorizing a corpus).

  2. Split the dataset into a training (70%) and a testing (30%) set to simulate the model's behavior on previously unseen data.

  3. Train the model on the training set. In scikit-learn, this comes down to calling the  fit()  method on the model.

  4. Evaluate the model's performance on the test set, scored by a metric such as accuracy, recall, AUC (area under a curve), or by inspecting the confusion matrix.

In this chapter, we'll focus on the feature extraction step, which, for NLP, equates to vectorizing a corpus. The goal is to demonstrate that it is possible to build a decent classifier model by counting the word occurrences in each document.

We will work on an excerpt of the classic Brown Corpus, the first million-word English electronic corpus created in 1961 at Brown University. This corpus contains text from 500 sources categorized by genre: news, editorial, romance, and humor, among others. To simplify things, we'll only consider the humor and science fiction categories.

The simplified dataset is available on the course Githhub Repo and contains two columns: topic and text. Let's load and explore it.

Load the Dataset

Load the dataset into a pandas DataFrame:

import pandas as pd
df = pd.read_csv('brown_corpus_extract_humor_science_fiction.csv')
> (2001, 2)

Now import spaCy and load a small English model:

import spacy
nlp = spacy.load("en_core_web_sm")

Preprocess the Data

The preprocessing step consists of the different tasks you saw in Part 1:

You can use spaCy to tokenize and lemmatize each text. Let's define a simple function that processes each text with the spaCy model and returns a list of lemmatized tokens.

def lemmatize(text):
doc = nlp(text)
tokens = [token.lemma_ for token in doc]
return tokens

Now, verify that the function works as expected.

text = "These are the good times, leave your cares behind."

This returns:

['these', 'be', 'the', 'good', 'time', ',', 'leave', '-PRON-', 'care', 'behind', '.']

Seems correct, no?

But of course, the stop words and punctuation signs haven't been removed yet! Spacy makes stop word removal easy. It comes with a list of 326 predefined stop words and a function  .is_stop  , which returns true when the token is a stop word. You can modify the  lemmatize()  function to filter out the tokens that are:

  • Stop words, using  .is_stop  .

  • Punctuation signs, using .is_punct.

def lemmatize(text):
doc = nlp(text)
tokens = [token.lemma_ for token in doc if not (token.is_stop or token.is_punct)]
return tokens

You can see that for the same sentence, "These are the good times, leave your cares behind," you now get the following tokens:

['good', 'time', 'leave', 'care']

Much cleaner! :magicien:

One last thing. Although it is possible to have lists as elements of a pandas DataFrame, it is much easier to work with strings. You can use a scikit-learn vectorizer that works directly on the text and not on lists of tokens. Let's modify the function one last time to return a string of all the tokens, separated by spaces.

def lemmatize(text):
doc = nlp(text)
tokens = [token.lemma_ for token in doc if not (token.is_stop or token.is_punct)]
return ' '.join(tokens)

That last version of  lemmatize()  applied to "These are the good times, leave your cares behind" now returns:

'good time leave care'

Now apply the lemmatize()  function to the whole corpus with:

df['processed_text'] = df.text.apply(lambda txt : lemmatize(txt))

The notebook contains extra steps to remove rows with few tokens. As you will see, you end up with 738 humor and 520 science fiction texts. You can find the notebook on the course Github Repo.

Vectorize the Data

Now that the text has been tokenized, lemmatized, and stop words and punctuation signs have been removed, the preprocessing phase is done. Next, we vectorize!

To vectorize the text, use scikit's vectorizer methods. Scikit-learn has three types of vectorizers,  Count  ,  tf-idf  and  hash  vectorizers. The CountVectorizer counts the word occurrences in each document.

It takes but three lines of Python:

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(df.processed_text)

X is now a sparse matrix of 1258 rows by 4749 columns. The number of columns corresponds to the size of the vocabulary. We could use the vectorizer parameters:  max_df  and  min_df  to filter words that are too frequent or too rare.

Before training a model, the last step is to transform the topic columns into class numbers. You arbitrarily choose 0 for humor and 1 for science fiction.

# transform the topic from string to integer
df.loc[df.topic == 'humor', 'topic' ] = 0
df.loc[df.topic == 'science_fiction', 'topic' ] = 1
# define the target variable
y = df.topic

As mentioned, for reasons beyond this course's scope, Naive Bayes classifiers perform well on text classification tasks. In scikit-learn, you can use the MultinomialNB model:

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
# 1. Declare the model
clf = MultinomialNB()
# 2. Train the model
clf.fit(X, y)
# 3. Make predictions
yhat = clf.predict(X)
# 4. score
print("Accuracy: ",accuracy_score(y, yhat))

The performance of that model is really good with an accuracy score of:

Accuracy: 0.9880763116057234

Lo and behold! 98.8% of the texts are correctly classified between humor and science fiction.

This is quite impressive. :magicien: But mostly because we trained the model on the whole corpus. In the companion notebook, we evaluate the model on unseen data and obtain 77% accuracy.

Let's Recap!

  • Vectorization is the general process of turning a collection of text documents, a corpus, into numerical feature vectors fed to machine learning algorithms for modeling.

  • Bag-of-words (BOW) is a simple but powerful approach to vectorizing text. As its name suggests, it does not consider the position of a word in the text.

  • Text classification is the main use-case of text vectorization using a bag-of-words approach.

  • document-term matrix is used as input to a machine learning classifier.

  • Use spaCy to write simple, short functions that do all the necessary text processing tasks.

The next chapter will look into another prevalent vectorization technique called tf-idf, which stands for term frequency-inverse document frequency. It is still part of the bag-of-words umbrella but offers much higher efficiency and power! 

Example of certificate of achievement
Example of certificate of achievement