We mentioned earlier, scikit-learn offers several types of vectorizers to create the document-term matrix, including count, tf-idf, and hash. In the previous chapter, we explored the CountVectorizer method. In this chapter, we will explore the widespread tf-idf vectorization method also used for text classification.
What Is TF-IDF?
The problem with counting word occurrences is that some words appear only in a limited number of documents. The model will learn that pattern, overfit the training data, and fail to generalize to new texts properly. Similarly, words that are present in all the documents will not bring any information to the classification model.
For this reason, it is sometimes better to normalize the word counts by the number of times they appear in the documents. This is the general idea behind the tf-idf vectorization.
Let's look more closely at what tf-idf stands for:
Tf stands for term frequency, the number of times the word appears in each document. We did this in the previous chapter with
Idf stands for inverse document frequency, an inverse count of the number of documents a word appears in. Idf measures how significant a word is in the whole corpus.
If you multiply tf with idf, you get the tf-idf score:
t is the word or token.
d is the document.
D is the set of documents in the corpus.
In scikit-learn, tf-Idf is implemented as the
TfidfVectorizer(you can read more on the scikit-learn documentation).
There are multiple ways to calculate the tf or the idf part of tf-idf depending on whether you want to maximize the impact of rare words or lower the role of frequent ones.
For instance, when the corpus is composed of documents of varying length, you can normalize by the size of each document:
Take the log:
Here, nt,d is the number of times the term t appears in the document d.
Similarly, the idf term can be calculated with different weight strategies. TfidfVectorizer offers multiple variants of tf-idf calculation through its parameters such as :
norm, among others.
Choose Between Count and TF-IDF
CountVectorizer over the
TfidfVectorizer depends on the nature of the documents you are working on. Tf-idf does not always bring better results than merely counting the occurrences of the words in each document!
Here are a couple of cases where tf could perform better than tf-idf :
If words are distributed equally across the documents, then normalizing by idf will not matter much. You'll see this in the example on the Brown corpus. The documents use roughly the same vocabulary in all texts. As such, taking into account each word's specificity across the corpus does not improve the model's performance.
If rare words do not carry valuable meaning to the classification model, then td-idf does not have a particular advantage. For example, when someone uses slang, that means something general in a comment on social media.
Identify the Limitations of TF-IDF
By concatenating each document's scores in the corpus, you get a vector. The dimension of the word vector equals the number of documents in the corpus.
For example, if the corpus holds four documents, the vector's dimension is 4. For a corpus of 1000 documents, the vector dimension is 1000.
Words that are not in the corpus do not get a vector representation, meaning that the vocabulary size and elements are also entirely dependent on the corpus at play.
We will explore numerical representations of words called embeddings (Word2vec, GloVe, fastText) in Part 3 of the course. These techniques are absolute and not dependent on the corpus, which is an important distinction!
Take It Further: Term-Term Matrix
The document-term matrix based on term frequency or tf-idf does not consider the context of a word. Instead of counting the word frequencies across documents, you can look at frequencies among neighboring words to capture the co-occurrence matrix's contextual information.
For context, use a window around the word (i.e., three words to the left and three to the right), in which case the cell represents the number of times the word occurs in a (±3) surrounding window.
Consider, for instance, this small corpus:
sentences = ['ways to replace the noisy Mac keyboard','do you prefer Windows or Mac','the Mac has a noisy keyboard','ways to install Windows on a Mac','you need a Windows license to install Windows on a Mac']
If you take a window of three tokens left and right, you end up with the following term-term co-occurrence matrix:
The tokens windows and mac appear three times close by, and the token noisy appears twice close to mac and keyboard but never close to windows.
There are many ways to implement a word-word co-occurrence matrix. The idea is to loop over each text in the corpus, build the surrounding window of words. You can find some examples in Python on this Stack Overflow page.
A tf-idf score is a decimal number that measures the importance of a word in any document.
In scikit-learn, tf-idf is implemented as the
TfidfVectorizer. There are several ways to compute the score for each term and document.
By concatenating the scores for each corpus document, you get a vector, and the dimension of the word vector equals the number of documents in the corpus.
A limitation to tf-idf is that it gives a numerical representation of words entirely dependent on the nature and number of documents considered. The same words will have different vectors in another context.
So far, we've looked at two vectorization techniques under the bag-of-words umbrella, and applied vectorization to text classification. In the next chapter, we will focus on sentiment analysis, another NLP classification problem widely used in the industry.