Introduction to Natural Language Processing

10 heures
Difficile

Licence

Ce cours est visible gratuitement en ligne.

course.header.alt.is_video

course.header.alt.is_certifying

J'ai tout compris !

Mis à jour le 15/12/2022

Train Your First Embedding Models

Set Up Your Environment

To train your first model, we’ll use the Shakespeare corpus, composed of all the lines of all the Shakespeare plays available on Kaggle (or here). The idea behind working on classic literature is not to be snobbish, but to find a corpus different enough from the ones word2vec and GloVe were trained on (Google U.S. News and Wikipedia). We expect the Shakespeare dataset to have a different worldview and vocabulary. The dataset is also large and already in a short-sequence format, which will speed up the calculations.

Load the dataset using the following code:

import urllib
import re

# change to your own path if you have downloaded the file locally
url = 'https://raw.githubusercontent.com/alexisperrier/intro2nlp/master/data/Shakespeare_alllines.txt'

# read file into list of lines
lines = urllib.request.urlopen(url).read().decode('utf-8').split("\n")

Remove all punctuation and tokenize with the following:

sentences = []

for line in lines:
   # remove punctuation
   line = re.sub(r'[\!"#$%&\*+,-./:;<=>?@^_`()|~=]','',line).strip()

   # simple tokenizer
   tokens = re.findall(r'\b\w+\b', line)

   # only keep lines with at least one token
   if len(tokens) > 1:
      sentences.append(tokens)

Train a word2vec Model

Let’s start by training a word2vec model with Gensim using the following parameters:

min_count : ignores the words that appear less times than this number.
size : the dimension of the embeddings. Let’s choose 50.
window : the size of the window around each target word. We’ll use a window size of 7.

Another important parameter sg determines whether to use CBOW or skip-grams as a training strategy. We’ll use ( sg =1 ).

We’ll call our model bard2vec. Get it? Shakespeare? The Bard? Okay, moving on.

from gensim.models import Word2Vec

bard2vec = Word2Vec(
         sentences,
         min_count=3,   # Ignore words that appear less than this
         vector_size=50,       # Dimensionality of word embeddings
         sg = 1,        # skipgrams
         window=7,      # Context window for words during training
         epochs=40)       # Number of epochs training over corpus

The training is pretty fast. We can explore our new model by looking at some similar words. Here are a few examples that give some insight into a Shakespearian view of the world: King, sword, husband, and Hamlet, of course.

At this point, feel free to experiment with the parameters of the word2vec model and check other words. Use bard2vec.wv.most_similar(word) to get the list of similar words:

most_similar('King')

most_similar('sword')

most_similar('husband')

most_similar('Hamlet')

Henry

Pepin

Richard

Edward

England

Pericles

Leontes

whereas

Fifth

hearse

scimitar

head

knife

dagger

rapier

hand

sleeve

scabbard

burgonet

Bringing

wife

mistress

son

mother

daughter

master

father

brother

Katharina

puppy

cousin

chuck

gaoler

Gertrude

Mercutio

sentence

Fenton

Escalus

Stanley

Advancing

As you can see, you end up with all things relevant to Shakespeare’s plays and epochs. If you train the model with different parameters, you will get different results.

For instance, if you use the following parameters for your model, the word similarity results will be quite different:

from gensim.models import Word2Vec

bard2vec = Word2Vec(
         sentences,
         min_count=3,   # same
         vector_size=50,  # same
         sg = 0,        # cbow instead of skip-grams
         window=10,      # larger context windows
         epochs=100)       # longer training

most_similar('King')

most_similar('sword')

most_similar('husband')

most_similar('Hamlet')

title

vial

Gaunt

Edward

Burgundy

Queen

king

Arthur

Scotland

corse

head

rapier

weapon

knife

finger

heart

dagger

horse

face

foot

wife

mistress

mother

son

father

sister

brother

daughter

master

friend

Canterbury

Northumberland

Clifford

Fortinbras

Gloucester

Gertrude

Margaret

York

Goneril

Cressid

Evaluate Your Embedding

There are two ways to evaluate embeddings: intrinsic and extrinsic.

Intrinsic evaluation involves comparing the embedding to a reference model. For instance, comparing word similarity to an existing lexical database, such as WordNet. You can also manually annotate the embedding process results, but this takes time and human resources.

Extrinsic evaluation consists of evaluating the model’s performance regarding a downstream task such as text classification, machine translation, or summarization. The downstream task has its own performance evaluation strategy, which gives you insight into the embedding model’s accuracy.

In short, no embedding is inherently better than another. The performance of your NLP task (classification, translation, etc.) is what matters in the end.

Let’s Recap!

We trained a word2vec model from scratch with Gensim on a Shakespeare corpus.
Finally, you learned that you can evaluate embedding models intrinsically or extrinsically.
- Intrinsic: The embedding is compared to a reference model (i.e., a lexical database).
- Extrinsic: The model is evaluated with an NLP task such as classification, machine translation, or summarization.

That’s a Wrap!

We are at the end of the course! You’ve learned how to preprocess text and transform it into vectors using bag-of-words and word embeddings. You also applied text vectorization to text classification, sentiment analysis, and unsupervised exploration. It was great fun to write this course, and I hope you enjoyed it!

Natural language processing is innovating at breathtaking speed. Its applications have a very significant impact on our lives. With NLP, we’re reaching out for the infinite diversity, elegance, and power of the human language. I hope this course gave you a taste of NLP and has motivated you to continue learning more. And yes, there is plenty more to learn, so stay curious, keep practicing, and keep learning! 😀

I am grateful to the OpenClassrooms team, whose ideas and support made a huge difference. 🧙Many thanks, folks. 🙂