• 10 hours
  • Hard

Free online content available in this course.

course.header.alt.is_certifying

Got it!

Last updated on 1/28/21

Train Your First Embedding Models

Log in or subscribe for free to enjoy all this course has to offer!

Set Up Your Environment

To train your first model, we'll use the Shakespeare corpus, composed of all the lines of all the Shakespeare plays available on Kaggle (or here). The idea behind working on classic literature is not to be all snobbish, but to find a corpus that is different enough than the ones word2vec and GloVe were trained on (that is, Google U.S. News and Wikipedia). We expect the Shakespeare dataset to have a different view of the world with a different vocabulary. The dataset is also large and already in a short-sequence format, which will speed up the sequence creation.

Load the dataset with:

import urllib
import re
# change to your own path if you have downloaded the file locally
url = 'https://dataskat.s3.eu-west-3.amazonaws.com/data/Shakespeare_alllines.txt'
# read file into list of lines
lines = urllib.request.urlopen(url).read().decode('utf-8').split("\n")

Remove all punctuation and tokenize with:

sentences = []
for line in lines:
# remove punctuation
line = re.sub(r'[\!"#$%&\*+,-./:;<=>?@^_`()|~=]','',line).strip()
# tokenizer
tokens = re.findall(r'\b\w+\b', line)
if len(tokens) > 1:
sentences.append(tokens)

Train a Word2vec Model

Let's start by training a word2vec model with Gensim, which comes down to instantiating it with the proper parameters:

  • min_count: ignores the words that appear less than this number.

  • size: the dimension of the embeddings. Let's choose 50.

  • window: the size of the window around each target word.

Another important parameter  sg  determines whether to use CBOW or Skip-Grams as a training strategy. We'll use Skip-Grams (sg =1).

We'll call our model bard2vec. Get it? Shakespeare? The Bard? Okay, moving on. 

from gensim.models import Word2Vec
bard2vec = Word2Vec(
sentences,
min_count=3, # Ignore words that appear less than this
size=50, # Dimensionality of word embeddings
sg = 1, # skipgrams
window=7, # Context window for words during training
iter=40) # Number of epochs training over corpus

The training is pretty fast. We can explore our new model by looking at some similar words. Here are a few examples that give some insight into a Shakespearian view of the world: Kingswordhusband, and Hamlet, of course.

At this point, feel free to experiment with the parameters of the word2vec model and check other words. Usebard2vec.wv.most_similar(word)to get the list of similar words:

most_similar('King')

most_similar('sword')

most_similar('husband')

most_similar('Hamlet')

Henry
Pepin
Richard
Edward
England
Pericles
Leontes
whereas
Fifth
hearse

scimitar
head
knife
dagger
rapier
hand
sleeve
scabbard
burgonet
Bringing

wife
mistress
son
mother
daughter
master
father
brother
Katharina
puppy

cousin
chuck
gaoler
Gertrude
Mercutio
sentence
Fenton
Escalus
Stanley
Advancing

As you can see, you end up with all things relevant to Shakespeare plays and epoch. If you train the model with different parameters, you will end up with different results.

Train a GloVe Model

The ability to train GloVe is not included in either Gensim or spaCy, but there several implementations in Python. For our purposes, we will use the glove-python library. Although a bit old, it is robust enough for our goal.

The library can be installed with  pip install glove_python:

from glove import Corpus, Glove # creating a corpus object
# instantiate the corpus
corpus = Corpus()
# this will create the word co occurence matrix
corpus.fit(sentences, window=10)
# instantiate the model
glove = Glove(no_components=50, learning_rate=0.05)
# and fit over the corpus matrix
glove.fit(corpus.matrix, epochs=30, no_threads=2)
# finally we add the vocabulary to the model
glove.add_dictionary(corpus.dictionary)

In the code above, has a vector size of 50 and a window of 10 words. The learning rate is set at a default value of 0.5 and dictates the convergence speed and accuracy of the SGD algorithm. 

Similarly to the word2vec mode, we can check the similarity of different words with  glove.most_similar(word, number=10).

Here is the result:

most_similar('King')

most_similar('sword')

most_similar('husband')

most_similar('Hamlet')

Duke
kings
image
front
instruments
York
Earl
Prince
senators

head
face
grave
soul
tongue
mind
daughter
horse
body

father
wife
daughter
mother
brother
tongue
master
mistress
mind

Angelo
coast
Lord
Antony
woful
AEneas
monstrous
Timon
where's

Did you notice that the list of similar words is different between GloVe and word2vec? How do you know which type of embedding is best?

Evaluate Your Embedding

There are two ways to evaluate embeddings: intrinsic and extrinsic.

Intrinsic evaluation involves comparing the embedding to a reference model. For instance, comparing word similarity to an existing lexical database, such as WordNet. You can also manually annotate the embedding process results, but this takes time and human resources. 

Extrinsic evaluation consists of evaluating the model's performance with regards to a downstream task such as text classification, machine translation, or summarization. The downstream task has its own performance evaluation strategy, which gives you insight into the embedding model's accuracy. 

For this course's purpose, we're going to leave the discussion about word evaluations here, but I wanted to give you a brief glimpse before rounding out the course.

Let's Recap!

  • We trained a word2vec model from scratch with Gensim on a Shakespeare corpus.

  • We also trained a GloVe model on the same corpus, and observed that each model give similar, but distinct, results in terms of word similarity.

  • Finally, you learned that embeddings models can be evaluated intrinsically or extrinsically. 

    • Intrinsic: The embedding is compared to a reference model (i.e., a lexical database).

    • Extrinsic: The model is evaluated with a downstream task such as classification, machine translation, or summarization. 

That's a Wrap!

We are at the end of the course! You've learned how to preprocess text, transform it into vectors using bag-of-words and word embeddings. You also got to apply that text vectorization to text classification, sentiment analysis, and unsupervised exploration! It was great fun to write this course, and I hope you enjoyed it!

Natural language processing is innovating at breathtaking speed, and applications have very significant impacts on our lives. With NLP, we're reaching for infinite diversity, elegance, and power of the human language. I hope this course gave you a taste of NLP and has motivated you to continue learning more! And yes, there is plenty more to learn, so stay curious, keep practicing, and keep learning!  :lol:

I am grateful to the OpenClassrooms team whose ideas and support made a huge difference. :magicien: Many thanks, folks. :)

Example of certificate of achievement
Example of certificate of achievement