Set Up Your Environment
To train your first model, we’ll use the Shakespeare corpus, composed of all the lines of all the Shakespeare plays available on Kaggle (or here). The idea behind working on classic literature is not to be snobbish, but to find a corpus different enough from the ones word2vec and GloVe were trained on (Google U.S. News and Wikipedia). We expect the Shakespeare dataset to have a different worldview and vocabulary. The dataset is also large and already in a short-sequence format, which will speed up the calculations.
Load the dataset using the following code:
import urllib
import re
# change to your own path if you have downloaded the file locally
url = 'https://raw.githubusercontent.com/alexisperrier/intro2nlp/master/data/Shakespeare_alllines.txt'
# read file into list of lines
lines = urllib.request.urlopen(url).read().decode('utf-8').split("\n")
Remove all punctuation and tokenize with the following:
sentences = []
for line in lines:
# remove punctuation
line = re.sub(r'[\!"#$%&\*+,-./:;<=>?@^_`()|~=]','',line).strip()
# simple tokenizer
tokens = re.findall(r'\b\w+\b', line)
# only keep lines with at least one token
if len(tokens) > 1:
sentences.append(tokens)
Train a word2vec Model
Let’s start by training a word2vec model with Gensim using the following parameters:
min_count
: ignores the words that appear less times than this number.size
: the dimension of the embeddings. Let’s choose 50.window
: the size of the window around each target word. We’ll use a window size of 7.
Another important parameter sg
determines whether to use CBOW or skip-grams as a training strategy. We’ll use ( sg =1
).
We’ll call our model bard2vec. Get it? Shakespeare? The Bard? Okay, moving on.
from gensim.models import Word2Vec
bard2vec = Word2Vec(
sentences,
min_count=3, # Ignore words that appear less than this
vector_size=50, # Dimensionality of word embeddings
sg = 1, # skipgrams
window=7, # Context window for words during training
epochs=40) # Number of epochs training over corpus
The training is pretty fast. We can explore our new model by looking at some similar words. Here are a few examples that give some insight into a Shakespearian view of the world: King, sword, husband, and Hamlet, of course.
At this point, feel free to experiment with the parameters of the word2vec model and check other words. Use bard2vec.wv.most_similar(word)
to get the list of similar words:
most_similar('King') | most_similar('sword') | most_similar('husband') | most_similar('Hamlet') |
Henry Pepin Richard Edward England Pericles Leontes whereas Fifth hearse | scimitar head knife dagger rapier hand sleeve scabbard burgonet Bringing | wife mistress son mother daughter master father brother Katharina puppy | cousin chuck gaoler Gertrude Mercutio sentence Fenton Escalus Stanley Advancing |
As you can see, you end up with all things relevant to Shakespeare’s plays and epochs. If you train the model with different parameters, you will get different results.
For instance, if you use the following parameters for your model, the word similarity results will be quite different:
from gensim.models import Word2Vec
bard2vec = Word2Vec(
sentences,
min_count=3, # same
vector_size=50, # same
sg = 0, # cbow instead of skip-grams
window=10, # larger context windows
epochs=100) # longer training
most_similar('King') | most_similar('sword') | most_similar('husband') | most_similar('Hamlet') |
title vial Gaunt Edward Burgundy Queen king Arthur Scotland corse | head rapier weapon knife finger heart dagger horse face foot | wife mistress mother son father sister brother daughter master friend | Canterbury Northumberland Clifford Fortinbras Gloucester Gertrude Margaret York Goneril Cressid |
Evaluate Your Embedding
There are two ways to evaluate embeddings: intrinsic and extrinsic.
Intrinsic evaluation involves comparing the embedding to a reference model. For instance, comparing word similarity to an existing lexical database, such as WordNet. You can also manually annotate the embedding process results, but this takes time and human resources.
Extrinsic evaluation consists of evaluating the model’s performance regarding a downstream task such as text classification, machine translation, or summarization. The downstream task has its own performance evaluation strategy, which gives you insight into the embedding model’s accuracy.
In short, no embedding is inherently better than another. The performance of your NLP task (classification, translation, etc.) is what matters in the end.
Let’s Recap!
We trained a word2vec model from scratch with Gensim on a Shakespeare corpus.
Finally, you learned that you can evaluate embedding models intrinsically or extrinsically.
Intrinsic: The embedding is compared to a reference model (i.e., a lexical database).
Extrinsic: The model is evaluated with an NLP task such as classification, machine translation, or summarization.
That’s a Wrap!
We are at the end of the course! You’ve learned how to preprocess text and transform it into vectors using bag-of-words and word embeddings. You also applied text vectorization to text classification, sentiment analysis, and unsupervised exploration. It was great fun to write this course, and I hope you enjoyed it!
Natural language processing is innovating at breathtaking speed. Its applications have a very significant impact on our lives. With NLP, we’re reaching out for the infinite diversity, elegance, and power of the human language. I hope this course gave you a taste of NLP and has motivated you to continue learning more. And yes, there is plenty more to learn, so stay curious, keep practicing, and keep learning! 😀
I am grateful to the OpenClassrooms team, whose ideas and support made a huge difference. 🧙Many thanks, folks. 🙂