Last updated on 3/4/22

## Compare Embedding Models

Embeddings come in different flavors. In this chapter, we'll look at the differences between word2vec, GloVe, and fastText.

Let's see how to generate a word2vec model.
Consider the following sentences ending with either the word cute or scary.

 Rabbits are cuteCats are cutePuppies are cute Bees are scarySpiders are scaryWorms are scary

You could train a (sort of) linear regression based on the animals in these sentences to find the probability that it ends with cute:

Similarly, you could train a second linear regression to find the probability that the sentence ends with the word scary:

Now, imagine doing the same regressions over a very large volume of sentences and training them to predict the ending word (the target word) of each sentence.

Each target word is associated with as many coefficients as there are words in the overall vocabulary. The coefficient associated with each predicted word is the word vector (otherwise known as the embedding). Note that we only care about the coefficients associated with each target word and not the linear regression itself.

The above example is a very high overview of the real training process. But the idea is very similar.

Sliding a window of n words over a sentence generates multiple examples of target words and their context.

Here's an example of some of the sequences generated by sliding a window with a context size of two words over the sentence "It's a warm summer evening in ancient Greece."

In this case, word2vec is trained to predict words based on context, as similar words will tend to have similar contexts.

Diving deeper into the exact training mechanism and architecture of word2vec is beyond the scope of this course, but if you want to learn more about it, here are three excellent and well- illustrated articles on the subject:

We've mentioned two other types of embedding models: GloVe and fastText. Let's take a closer look at how they are generated.

GloVe stands for Global Vectors. It is an open source project from the Stanford University.

As you now know, the main by-product of word2vec is its ability to encode the meaning of words and, more precisely, to allow writing differences such as queen - woman = king - man.

These contextual words are called probe words. For instance, pizza and burgers are more likely to be used in the same context than pizza and pavement, sword or factory. The probe words enable you to define the variety of contexts that the main word is used in.

Let's measure the proximity of two words: ice and steam to a probe of the word solid. The exact numbers calculated over a huge corpus of over 6 billion words are in the table below. You have:

• p(solid / ice) = number of times ice and solid are close by/number of times ice is present in the corpus. = 1.9 x 10^{-4}.

• p(solid / steam) = number of times solid and steam are close by/number of times steam is present in the corpus = 12.2 x 10^{-5}.

The ratio of these two measures (=8.9) gives you a direct indication of the semantic proximity of steam and ice to the probe word solid. With a ratio of 8.9, you can conclude that the word solid is closer to ice than it is to steam. Sounds legit.

Similarly, let's now take water as the probe word.

• p(water / ice) = number of times ice and water are close by/number of times ice is present in the corpus = 3.0 x 10^{-3}.

• p(water / steam) = number of times water and steam are close by/number of times steam is present in the corpus = 2.2 x 10^{-3}.

The ratio of these two numbers measures steam and ice proximity to the word water. Here you get 1.36, a number close to 1, which indicates that water is as close to ice as it is to steam. It makes sense.

These relationships can also be illustrated by drawing the vector difference between the original words (ice and steam) and the probe words: solid, gas. You see that solid is closer to ice than is it to steam. Therefore, d(solid, ice) < d(solid/ steam).

Now imagine doing the same computations for all the word pairs in your vocabulary to all the other probe words. In the end, you would get a very fine-grained measure of the proximity of each pair of words to many different concepts. Since that would involve a huge matrix, you can limit contextual words to a window of n words to the main words' left and right.

The next steps to get GloVe embeddings are more mathematically involved and beyond the scope of this course. The authors define a loss function that preserves these ratios' subtraction operation by taking a log of the ratio. That loss function can be minimized using classic stochastic gradient descent (SGD), a highly-optimized algorithm. Enough said.

An article published by Aylien entitled An overview of word embeddings and their connection to distributional semantic models , explains the difference between GloVe and word2vec:

In contrast to word2vec, GloVe seeks to make explicit what word2vec does implicitly: Encoding meaning as vector offsets in an embedding space -- seemingly only a serendipitous by-product of word2vec -- is the specified goal of GloVe.

In practice, GloVe uses a highly-optimized algorithm (SGD) during training, which converges even on a small corpus. The consequence is that, compared to Word2vec, GloVe offers:

• Faster training.

• Better RAM/CPU efficiency (can handle larger documents).

• More efficient use of data (helps with smaller corpora).

• More accurate for the same amount of training.

To solve the issue of OOV, fastText uses a tokenization strategy based on character n-grams and whole words. A character n-gram is simply a sequence of n letters.

For instance, the word window will not only have its own vector, but it will also generate vectors for all its character bigrams: wi, in, nd, do, and ow and trigrams: win, ind, ndo, and dow. The word table will generate vectors for ta, ab, bl, le, and tab, abl, and ble, etc.

Besides this flexible and granular tokenization strategy, fastText implements a rapid, straightforward word2vec training with either CBOW or skip-grams.

Here are some other key characteristics of fastText:

• It offers word vectors for nearly 300 languages (you can find these vectors in the fastText documentation).

• It is available in Python.

• It facilitates training a text classifier.

• It has a specific model for language detection (you can read more about language detection in the fastText documentation). This comes in very handy when filtering social network chaff in multiple languages.

In this chapter, we focused on the nature of three types or word embeddings: word2vec, GloVe, and fastText.

• Word2vec is obtained by using a neural network to predict missing words in sentences, and taking the coefficient of the last layer of the neural network as the elements of the word vector

• GloVe focuses on capturing the similarity in context between words. It is lighter and more efficient than word2vec.

• FastText works with sub-word tokenization and, as a consequence, can handle out-of- vocabulary words.

Now that you better understand the inner workings of word2vec, GloVe and fastText, let's take these models for a spin by creating embeddings on a specific corpus.