As mentioned earlier, scikit-learn offers several types of vectorizers to create the document-term matrix, including count
, tf-idf
, and hash
. In the previous chapter, we explored the CountVectorizer
method. In this chapter, we will explore the tf-idf vectorization method also used for text classification.
What Is TF-IDF?
The problem with counting word occurrences is that some words appear only in a limited number of documents. The model will learn that pattern, overfit the training data, and fail to generalize to new texts properly. Similarly, words that are present in all the documents will not bring any information to the classification model.
For this reason, it is sometimes better to normalize the word counts by the number of times they appear in the documents. This is the general idea behind the tf-idf vectorization.
Let’s look more closely at what tf-idf stands for:
Tf stands for term frequency, the number of times the word appears in each document. We did this in the previous chapter with
CountVectorizer
.Idf stands for inverse document frequency, an inverse count of the number of documents a word appears in. Idf measures how significant a word is in the whole corpus.
If you multiply tf with idf, you get the tf-idf score:
Where:
is the word or token.
is the document.
is the set of documents in the corpus.
Calculate TF-IDF
In scikit-learn, tf-Idf is implemented as the TfidfVectorizer
(you can read more on the scikit-learn documentation).
There are multiple ways to calculate the tf or the idf part of tf-idf depending on whether you want to maximize the impact of rare words or lower the role of frequent ones.
For instance, when the corpus is composed of documents of varying lengths, you can normalize by the size of each document:
Take the log:
Here, is the number of times the term appears in the document .
Similarly, you can calculate the idf term with different weight strategies. TfidfVectorizer offers multiple variants of tf-idf calculation through its parameters such as: sublinear_tf
, smooth_idf
, and norm
, among others.
Apply TF-IDF on a Simple Corpus
Let’s apply tf-idf on the same corpus as before to see the difference.
We’ll use the same code as before, just replacing the CountVectorizer
with the TfidfVectorizer
.
corpus = [
'2 cups of flour',
'replace the flour',
'replace the keyboard in 2 minutes',
'do you prefer Windows or Mac',
'the Mac has the most noisy keyboard',
]
# import and instantiate the vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
# apply the vectorizer to the corpus
X = vectorizer.fit_transform(corpus)
# display the document-term matrix
vocab = vectorizer.get_feature_names_out()
docterm = pd.DataFrame(X.todense(), columns=vocab)
This returns a new document-term matrix with floats instead of integers.
Choose Between Count and TF-IDF
Choosing the CountVectorizer
over the TfidfVectorizer
depends on the nature of the documents you are working on. Tf-idf does not always bring better results than merely counting the occurrences of the words in each document.
Here are a couple of cases where CountVectorizer
could perform better than TfidfVectorizer
:
If words are distributed equally across the documents, then normalizing by idf will not matter much. You’ll see this in the example on the Brown Corpus. The documents use roughly the same vocabulary in all texts. Considering each word’s specificity across the corpus does not improve the model’s performance.
If rare words do not carry valuable meaning to the classification model, then tf-idf does not have a particular advantage. For example, when someone uses slang, that means something general in a comment on social media.
Identify the Limitations of TF-IDF
As you’ve seen before, the dimension of the word vector equals the number of documents in the corpus.
For example, if the corpus holds four documents, the vector’s dimension is four. For a corpus of 1000 documents, the vector dimension is 1000.
Words that are not in the corpus do not get a vector representation, meaning that the vocabulary size and elements are also entirely dependent on the corpus at play.
We will explore numerical representations of words called embeddings (word2vec, GloVe, fastText) in Part 3 of the course. These techniques are absolute and not dependent on the corpus, which is an important distinction!
Let’s Recap!
A tf-idf score is a decimal number that measures the importance of a word in any document.
In scikit-learn, tf-idf is implemented as the
TfidfVectorizer
. There are several ways to compute the score for each term and document.By concatenating the scores for each corpus document, you get a vector, and the dimension of the word vector equals the number of documents in the corpus.
A limitation of tf-idf is that it gives a numerical representation of words entirely dependent on the nature and number of documents considered. The same words will have different vectors in another context.
So far, we’ve looked at two vectorization techniques under the bag-of-words umbrella, and applied vectorization to text classification. In the next chapter, we will focus on sentiment analysis, another NLP classification problem widely used in the industry.