In Part 2, you learned that text vectorization is vital for any subsequent classification task. You also learned that the bag-of-words approach, with tf-idf, is a simple and quite efficient method, but it has several shortcomings:
Context and meaning is lost.
The document-term matrix is large and sparse.
Vectorization is relative to the corpus (similar words will have different vectors on another corpus).
In Part 3, we will look at another text vectorization system called word embeddings that will help overcome these shortcomings!
Identify the Benefits of Word Embeddings
In 2013, a new text vectorizing method called embeddings took NLP by storm. An embedding technique called Word2vec was born, soon to be followed by GloVe and fastText.
These new text vectorization techniques solved the inherent shortcomings of tf-idf. They also somehow managed to retain semantic similarity between words, meaning that these vectors can recognize the meaning of a word and determine its similarity to others. Let's explore further:
1. They Retain Semantic Similarity
As mentioned, one of the most remarkable properties of embeddings is their ability to capture the semantic relationship between words. For example:
A hammer and pliers are both tools. Since they are related or similar in meaning, their vectors will be near one another. Similar to the words apple and pear or truck and vehicle.
When visualizing word vectors in a 2D space, similar words are grouped in the same regions. The figure below shows the five most similar words: Paris, London, Moscow, Twitter, Facebook, pizza, fish, train, and car, according to Word2vec embeddings.
As you can see, similar words keep their semantic distance! Truly amazing!
With embeddings, it also becomes possible to capture analogies between words. For example, a woman is to a queen what a man is to a king; Paris is to France what Berlin is to Germany. You can also add and subtract words.
In this case, the distance between the respective vectors for woman and queen is close to the distance between the vectors for man and king.
But semantic similarity is not the only advantage of embeddings. Let's take a look at some more.
2. They Have Dense Vectors
Word embeddings are dense vectors, meaning that all values are non-zero (except for the occasional element). Therefore, more information is given to the classification or clustering model, leading to better classification performances.
3. They Have a Constant Vector Size
With word embeddings, the vector size is no longer dependent on the number of documents in your corpus!
When training embedding models, the dimension of the word vector is a parameter of the model. You decide beforehand what vector size you need to represent each word. Pre-trained embeddings usually come in dimensions 50, 100, and 300.
4. Their Vector Representations are Absolute
Word embeddings are trained on gigantic datasets. Word2vec, for instance, was trained on a Google News dataset of 100 billion words, GloVe on a dataset of 6 billion words, and fastText on 16 billion tokens. As a direct consequence, these models have very large vector representations. Word2vec has 3 million vectors, GloVe has 400.000, and fastText has 1 million vectors.
5. They Have Multiple Embedding Models
Last but not least, you can download pre-trained models and use the word vectors directly. No need to generate them for each new corpus!
There are multiple types of pre-trained embedding models available online. A list is available on the gensim-data repository. Another powerful word embedding library is spaCy.
The main differences between the models are:
Their creation process: word2vec, GloVe and fastText are trained in different ways (we will explore this further in the next chapter).
The different vector sizes: these are arbitrarily set prior to training the model.
The nature of their training data and the vocabulary it holds.
Now that you understand the power of word embeddings, let’s take a closer look at some of the pre-trained models from the gensim-data repository!
The Functionalities of Gensim
Genism and spaCy are two major Python libraries that work with pre-trained word embeddings. Since we have already explored spaCy in previous parts, let's experiment with Gensim.
Gensim can be installed with:
pip install --upgrade gensim
conda install -c conda-forge gensim
Gensim documentation is less appealing than NLTK or spaCy, but it is a fundamental component of NLP.
Gensim allows you to work with the several types of embeddings mentioned earlier straight out of the box. In this chapter, we are going to focus on the original 1.7 Gb word2vec model: word2vec-google-news-300.
Load it in Gensim with:
import gensim.downloader as apimodel = api.load("word2vec-google-news-300")
Let's start with three Gensim functions:
model[word]to get the actual word vector.
most_similarfor a list of words that are most similar to a given word.
similaryto compute a similarity score between two words.
model['book']returns a vector with 300 elements:
# 10 first elements of the book vectormodel['book'][:10]> array([ 0.11279297, -0.02612305, -0.04492188, 0.06982422, 0.140625 ,0.03039551, -0.04370117, 0.24511719, 0.08740234, -0.05053711],dtype=float32)
most_similar() function returns the 10 most similar words and their similarity scores:
model.most_similar("book")> [('tome', 0.7485830783843994),('books', 0.7379178404808044),('memoir', 0.7302927374839783),('paperback_edition', 0.6868364810943604),('autobiography', 0.6741527915000916),('memoirs', 0.6505153179168701),('Book', 0.6479282379150391),('paperback', 0.6471226811408997),('novels', 0.6341458559036255),('hardback', 0.6283079385757446)]model.most_similar("apple")> [('apples', 0.7203598022460938),('pear', 0.6450696587562561),('fruit', 0.6410146355628967),('berry', 0.6302294731140137),('pears', 0.6133961081504822),('strawberry', 0.6058261394500732),('peach', 0.6025873422622681),('potato', 0.596093475818634),('grape', 0.5935864448547363),('blueberry', 0.5866668224334717)]
similarity() function calculates the cosine similarity score between two words:
model.similarity("apple", "banana")> 0.5318406model.similarity("apple", "dog")> 0.21969718model.similarity("cat", "dog")> 0.76094574
According to word2vec, a cat is more similar to a dog than it is to an apple (makes sense). Interestingly, cat and dog are also more similar than apple and banana!
The Word2vec Vocabulary
You can explore examples of the word2vec vocabulary with
model.vocab, which returns a dictionary of tokens. The words are the dictionary keys, and their values are the index of the word in the Gensim model. You can randomly sample five tokens from the model's vocabulary with:
import numpy as npvocab = model.vocab.keys()np.random.choice(vocab, 5)
Execute these lines a few times. You get tokens such as:
['Vancouver_Canucks_goaltender', 'eSound', 'DLLs', '&A;', 'Rawdha'] ['Hodeidah', 'Cheatum', 'Mbanderu', 'common_equityholders', 'microfabricated'] ['Dataflow', 'Ty_Ballou', 'Scott_RUFFRAGE','prawn_dish', 'offering']
The word2vec vocabulary is not only composed of regular words. It also contains:
Bigrams and trigrams: common_equityholders , prawn_dish, Vancouver_Canucks_goaltender
Proper nouns: Scott_RUFFRAGE or Ty_Ballou
Specific character sequences: &A
Take it Further: Calculate Similarity Between Words
Remember how we talked about the similarity between words? Are you curious about how it's calculated? Let's explore.
Similarity between words is calculated using a metric called cosine similarity.
For vectors A and B, cosine similarity is defined by:
→A=[a1,a2,⋯,an] and →B=[b1,b2,⋯,bn]
‖→A‖2 is the L2 norm of the vector →A
→A⋅→B is the dot product (i.e., the product of each element ) of →A with each element of →B
For instance, if →A and →B are 3D vectors.
Cosine similarity is not the only metric you can use to calculate similarity between words. However, it is the most common one when working with word embeddings.
The Shortcomings of Word Embeddings
As mentioned before, these models are dependent on the data they were trained on with two significant side effects: cultural bias and out-of-vocabulary (OOV) issues.
The word2vec model was trained on a massive U.S. Google News corpus. It learned the relationships between words on the news as seen by Google in the U.S. There's nothing inherently biased about news in the U.S. versus some other corpus from another part of the world, but training on such a dataset means that the model inherits a certain dose of cultural bias.
One funny example happens to be with my own surname, Alexis. In the U.S., Alexis is a feminine name, whereas, in the rest of the world (as far as I know), it's a masculine name. You can see the U.S. bias by looking at words most similar to Alexis, according to word2vec:
model.most_similar('Alexis')> ('Nicole', 0.7480324506759644),('Erica', 0.745315432548523),('Marissa', 0.7406104207038879),('Alicia', 0.7322115898132324),('Jessica', 0.7315416932106018),....
There are mostly feminine names. It is not a big deal, but it is a good illustration of the inherent cultural learnings of the word2vec model. For more critical issues, you should be aware that these models are not universal or neutral but directly influenced by the corpus they were trained on.
Out-of-Vocabulary Words (OOV)
Another issue with GloVe and word2vec is the finite nature of the model's vocabulary.
You have seen that the word2vec list of known tokens is heterogeneous and complex. But even with over 3 million entries, there are certain words it can't identify. In other words, words for which there is no associated vector. For example:
Covid (in all its cases and with or without "-19")
Word2vec (yes, word2vec does not know about its own existence).
There are different strategies to handle out-of-vocabulary words. The most simple one is to return a vector of zeros for unknown words.
try:return model[word]except:return numpy.zeros(N)
Another possibility is to use the pre-trained Word2vec model and your own dataset to continue training the model. Out-of-vocabulary words that are present in your dataset will end up with their own vector representations. The process is called fine-tuning or transfer learning and is particularly helpful when working on domain-specific corpuses (healthcare, biomedical, law, or astronomy).
There are three kinds of word embedding techniques: word2vec, GloVe, and fastText.
Word embeddings help overcome the inherent shortcomings of the bag-of-words approach.
They capture the semantic relationship between words.
They are dense vectors, meaning that all their values are non-zero. More information is given to the model.
The vector size is constant and independent of the number of documents in your corpus.
Vector representations are also independent of the nature and content of the corpus.
You can download pre-trained models from the gensim-data repository.
Word embeddings have their own set of shortcomings:
They carry cultural bias from the training dataset.
There are certain words that they cannot identify. These are called out-of- vocabulary words.
In the next chapter, we will dive deeper into the inner workings of word2vec, GloVe, and fastText!