Introduction to Natural Language Processing

10 hours
Hard

Free online content available in this course.

course.header.alt.is_video

course.header.alt.is_certifying

Got it!

Last updated on 12/15/22

Compare Embedding Models

Embeddings come in different flavors. In this chapter, we’ll look at the differences between word2vec, GloVe, and fastText.

Get to Know word2vec

Let’s see how to generate a word2vec model.

Consider the following sentences ending with either the word “cute” or “scary”.

Rabbits are cute
Cats are cute
Puppies are cute

Bees are scary
Spiders are scary
Worms are scary

You could train a (sort of) linear regression based on the animals in these sentences to find the probability that it ends with “cute”:

$$$p(sentence ~ends ~with ~cute)= a_1\times Rabbit + a_2\times Cats + a_3\times Puppies + a_4\times Bees + a_5\times Spiders + a_6\times Worms$$$

Similarly, you could train a second linear regression to find the probability that the sentence ends with the word “scary”:

$$$ $p(sentence ~ends ~with ~scary)= b_1\times Rabbit + b_2\times Cats + b_3\times Puppies + b_4\times Bees + b_5\times Spiders + b_6\times Worms$$

Now we can regroup the ai coefficients of the regression over “cute” in a vector: $$$[a_1, a_2, ...,a_6 ]$$$ . This vector is the embedding of the word “cute”. Respectively, the vector $$$[b_1, b_2, ...,b_6 ]$$$ becomes the embedding of the word “scary”.

Now imagine doing the same regression exercise over millions of sentences and all the words they include:

First, pick a target word.
For each sentence, calculate the coefficients of the regression of the target word for all the other words in the corpus (that’s a lot of coefficients).
Average all the vectors to have a unique vector.
Then, reduce the vector size so that it no longer depends on the vocabulary size (dimension reduction).
Obtain a universal vector for the initial target word.
Rinse and repeat for all the words in the vocabulary!

The above example is a high overview of the real training process. But the idea is very similar.

Sliding a window of n words over a sentence generates multiple examples of target words and their context.

Here’s an example of some of the sequences generated by sliding a window with a context size of two words over the sentence, “It’s a warm summer evening in ancient Greece.” (Google the phrase ;))

Sliding window of 5 words over the sentence “It's a warm summer evening in ancient Greece” — Target words with n=2 sliding window of context words

In this case, word2vec is trained to predict words based on context, as similar words tend to have similar contexts.

Diving deeper into the exact training mechanism and architecture of word2vec is beyond the scope of this course, but if you want to learn more about it, here are three excellent and well-illustrated articles on the subject:

On word embeddings by Sebastian Ruder.
Word2vec tutorial - The Skip-Gram Model by Chris McCormick.
The Illustrated Word2vec by Jay Alammar.

We’ve mentioned two other types of embedding models: GloVe and fastText. Let’s take a closer look at how they are generated.

Get to Know GloVe

GloVe stands for Global Vectors. It is an open-source project from Stanford University.

As you now know, the main by-product of word2vec is its ability to encode the meaning of words and, more precisely, to allow writing differences such as queen - woman = king - man.

These contextual words are called probe words. For instance, pizza and burgers are more likely to be used in the same context (meal, restaurant, etc.) than pizza and pavement or burger and bebop.

By measuring the proximity of different words to a given probe word, you can derive a vector per probe word.

The inner calculations of deriving GloVe embeddings are quite mathematically involved and beyond the scope of this course. ;)

Get to Know fastText

To solve the issue of out-of-vocabulary (OOV) words, fastText uses a tokenization strategy based on whole words, as well as subsets of 2, 3 consecutive letters, which are called n-grams.

For instance, the word window has bigrams (2-grams): wi, in, nd, do, and ow, and trigrams (3-grams): win, ind, ndo, and dow. The word table has the following bigrams: ta, ab, bl, le, and trigrams: tab, abl, and ble, etc. fastText will generate vectors for the words and their bigrams and trigrams.

Besides this flexible and granular tokenization strategy, fastText implements rapid, straightforward word2vec training.

Here are some other key characteristics of fastText:

It offers word vectors for nearly 300 languages. You can find these vectors in the fastText documentation.
It has a specific model for language detection (read more about language detection in the fastText documentation). This comes in very handy when filtering social network chaff in multiple languages. 🙂

Let’s Recap!

In this chapter, we focused on the nature of three types of word embeddings: word2vec, GloVe, and fastText.

word2vec uses a neural network to predict target words in sentences and takes the coefficient of the last layer of the neural network as the elements of the word vector.
GloVe focuses on capturing the similarity in context between words. It is lighter and more efficient than word2vec.
fastText works with n-grams tokenization and, consequently, can handle out-of-vocabulary words.

Now that you better understand the inner workings of word2vec, GloVe, and fastText, let’s take these models for a spin by creating embeddings on a specific corpus.