Set Up Your Environment
To train your first model, we'll use the Shakespeare corpus, composed of all the lines of all the Shakespeare plays available on Kaggle (or here). The idea behind working on classic literature is not to be all snobbish, but to find a corpus that is different enough than the ones word2vec and GloVe were trained on (that is, Google U.S. News and Wikipedia). We expect the Shakespeare dataset to have a different view of the world with a different vocabulary. The dataset is also large and already in a short-sequence format, which will speed up the sequence creation.
Load the dataset with:
import urllibimport re# change to your own path if you have downloaded the file locallyurl = 'https://dataskat.s3.eu-west-3.amazonaws.com/data/Shakespeare_alllines.txt'# read file into list of lineslines = urllib.request.urlopen(url).read().decode('utf-8').split("\n")
Remove all punctuation and tokenize with:
sentences = for line in lines:# remove punctuationline = re.sub(r'[\!"#$%&\*+,-./:;<=>?@^_`()|~=]','',line).strip()# tokenizertokens = re.findall(r'\b\w+\b', line)if len(tokens) > 1:sentences.append(tokens)
Train a Word2vec Model
Let's start by training a word2vec model with Gensim, which comes down to instantiating it with the proper parameters:
min_count: ignores the words that appear less than this number.
size: the dimension of the embeddings. Let's choose 50.
window: the size of the window around each target word.
Another important parameter
sg determines whether to use CBOW or Skip-Grams as a training strategy. We'll use Skip-Grams (
We'll call our model bard2vec. Get it? Shakespeare? The Bard? Okay, moving on.
from gensim.models import Word2Vecbard2vec = Word2Vec(sentences,min_count=3, # Ignore words that appear less than thissize=50, # Dimensionality of word embeddingssg = 1, # skipgramswindow=7, # Context window for words during trainingiter=40) # Number of epochs training over corpus
The training is pretty fast. We can explore our new model by looking at some similar words. Here are a few examples that give some insight into a Shakespearian view of the world: King, sword, husband, and Hamlet, of course.
At this point, feel free to experiment with the parameters of the word2vec model and check other words. Use
bard2vec.wv.most_similar(word)to get the list of similar words:
As you can see, you end up with all things relevant to Shakespeare plays and epoch. If you train the model with different parameters, you will end up with different results.
Train a GloVe Model
The ability to train GloVe is not included in either Gensim or spaCy, but there several implementations in Python. For our purposes, we will use the glove-python library. Although a bit old, it is robust enough for our goal.
The library can be installed with
pip install glove_python:
from glove import Corpus, Glove # creating a corpus object# instantiate the corpuscorpus = Corpus()# this will create the word co occurence matrixcorpus.fit(sentences, window=10)# instantiate the modelglove = Glove(no_components=50, learning_rate=0.05)# and fit over the corpus matrixglove.fit(corpus.matrix, epochs=30, no_threads=2)# finally we add the vocabulary to the modelglove.add_dictionary(corpus.dictionary)
In the code above, has a vector size of 50 and a window of 10 words. The learning rate is set at a default value of 0.5 and dictates the convergence speed and accuracy of the SGD algorithm.
Similarly to the word2vec mode, we can check the similarity of different words with
Here is the result:
Did you notice that the list of similar words is different between GloVe and word2vec? How do you know which type of embedding is best?
Evaluate Your Embedding
There are two ways to evaluate embeddings: intrinsic and extrinsic.
Intrinsic evaluation involves comparing the embedding to a reference model. For instance, comparing word similarity to an existing lexical database, such as WordNet. You can also manually annotate the embedding process results, but this takes time and human resources.
Extrinsic evaluation consists of evaluating the model's performance with regards to a downstream task such as text classification, machine translation, or summarization. The downstream task has its own performance evaluation strategy, which gives you insight into the embedding model's accuracy.
For this course's purpose, we're going to leave the discussion about word evaluations here, but I wanted to give you a brief glimpse before rounding out the course.
We trained a word2vec model from scratch with Gensim on a Shakespeare corpus.
We also trained a GloVe model on the same corpus, and observed that each model give similar, but distinct, results in terms of word similarity.
Finally, you learned that embeddings models can be evaluated intrinsically or extrinsically.
Intrinsic: The embedding is compared to a reference model (i.e., a lexical database).
Extrinsic: The model is evaluated with a downstream task such as classification, machine translation, or summarization.
That's a Wrap!
We are at the end of the course! You've learned how to preprocess text, transform it into vectors using bag-of-words and word embeddings. You also got to apply that text vectorization to text classification, sentiment analysis, and unsupervised exploration! It was great fun to write this course, and I hope you enjoyed it!
Natural language processing is innovating at breathtaking speed, and applications have very significant impacts on our lives. With NLP, we're reaching for infinite diversity, elegance, and power of the human language. I hope this course gave you a taste of NLP and has motivated you to continue learning more! And yes, there is plenty more to learn, so stay curious, keep practicing, and keep learning!
I am grateful to the OpenClassrooms team whose ideas and support made a huge difference. Many thanks, folks.