• 10 hours
  • Hard

Free online content available in this course.


Got it!

Last updated on 3/4/22

Create a Unique Word Form With SpaCy

Log in or subscribe for free to enjoy all this course has to offer!

The Issue With Multiple Word Forms

In the previous chapters, we tokenized the Earth text and removed stop words from the original text, which improved the word cloud visualization. We then looked deeper into the notion of tokenization and explored the NLTK library to preprocess text data.

But there's more!

Words can take multiple forms in a text.

For instance, the verb to be is usually found conjugated throughout the text with forms such as: isamarewas, etc. These word forms end up being counted as separate words, although they all relate to the verb's infinitive. 

In addition to conjugations, other word forms include:

  • Singular and plurals : language and languages, word and wordsetc.

  • Gerunds (present participles): giving, parsing, learningetc.

  • Adverbs: most often ending in ly: bad:badly; rough:roughly.

  • Participle: given, taken, etc.

Root word with different endings generate multiple word forms. In this example, the root word is
A root word with different possible endings generate multiple           word forms.

But...why do you need a single word form for each meaningful word in the text? 

You can reduce a word's variant to a unique form with two different methods: stemming or lemmatization. 

Stem Words (Remove the Suffix of a Word)

Stemming is the process of removing the suffix of a word based on the assumption that different word forms (i.e., lightning, lightly, lighting) consist of a root word (light) and an ending (+ning, + ly, + ing).

Although words may contain prefixes and suffixes, stemming removes suffixes. And it does so rather brutally!

Let's look at a couple of examples with the words change and study. You only keep the roots: "chang" and "studi," and drop the endings for every variation.

The words
Examples of stemming

Stemming does not care if the root is a real word or not (i.e., studi), which sometimes makes it difficult to interpret the NLP task results.

Let's see how to apply stemming to Wikipedia's Earth page. First, tokenize the text, and for each token, extract the stem of the word.

from nltk.tokenize WordPunctTokenizer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
# Get the text, for instance from Wikipedia.
# see chap 1 for the wikipedia_page function
text = wikipedia_page('Earth').lower()
# Tokenize and remove stopwords
tokens = WordPunctTokenizer().tokenize(text)
tokens = [tk for tk in tokens if tk not in stopwords.words('english')]
# Instantiate a stemmer
ps = PorterStemmer()
# and stem
stems = [ps.stem(tk) for tk in tokens ]

Let's inspect what kind of stems were generated by picking a random sample from the list:

import numpy as np
np.random.choice(stems, size = 10)

If you chose a different sample, your results would be different than mine. Here are my results:

> ['subtrop', 'dure', 'electr', 'eurasian', 'univers', 'cover', 'between','from', 'that', 'earth']
> ['in', 'interior', 'roughli', 'holocen', 'veloc', 'impact', 'in', 'point', 'the', 'come']
> ['caus', 'proxim', 'migrat', 'lithospher', 'as', 'on', 'are', 'earth', 'low', 'also']

So among whole words such as Earthlow, or point, you also have truncated words:  subtropelectrroughli, causproxim.

As you can see from that example, stemming is a one-way process. It isn't easy to understand the original word: electr. Was it electronic, electrical, electricity, or electrons? Is the stem univers related to universities or the universe? It's impossible to tell.

Stemming is a bit crude, and you want more than just the rough root of the word. For that, you use lemmatization.

Lemmatize Words (Reduce Words to a Canonical Form)

The lemma is the word form you would find in a dictionary. The word universities is found under university, while universe is found under universe—no room for misinterpretation. A lemma is also called the canonical form of a word.

The words
Lemmatization of the words studying, studies, and study

A lemmatizer not only finds the most appropriate and essential version of a word; it also looks at the grammatical role in the sentence to find its canonical form.

The sentence
Lemmatization based on a whole sentence

In this last example, the word meeting is lemmatized as meeting when it is a noun and as meet when it is a verb. In both examples, the words was and am were lemmatized into be.

NLTK has a lemmatizer based on WordNet, an extensive lexical database for the English language. You can access that database on the nltk.stem package documentation page. However, my lemmatizer of choice is the one from the spacy.io library.

Tokenize and Lemmatize With SpaCy

The spacy.io library is a must-have for all NLP practitioners. The library covers several low-level tasks such as tokenization, lemmatization, and part-of-speech (POS) tagging. It also offers named entity recognition (NER) and embeddings for dozens of languages.

You can install spaCy with conda  or  pip. Once you install the library, you need to download the model that fits your needs. The spaCy models are language-dependent and vary in size. Follow the instructions on the install page and download the small English model en_core_web_sm.

Using spaCy on a text involves three steps in Python:

  1. import spaCy  .

  2. Load the model  nlp = spacy.load("en_core_web_sm")  .

  3. Apply the model to the text:  doc = nlp("This is a sentence.")  .

The nlp model is the source of magic. While the doc object contains the information inferred by spaCy using the nlp  model, doc is an iterable object over which you can loop.

When applying the nlp model on a text, spaCy carries out its own parsing and text analysis.

Tokenize With SpaCy

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Roads? Where we’re going we don’t need roads!")
for token in doc:

This generates the following list of tokens:

[Roads, ?, Where, we, ’re, going, we, do, n’t, need, roads, !]

You can see that the tokenization properly handled the punctuation signs: ? and !. But there's plenty more to look at!

Each element of the doc object holds information on the nature and style of the token:

  • is_space: is the token a space.  

  • is_punct:  is the token a punctuation sign.

  • is_upper:  is the token all uppercase.

  • is_digit: is the token a number.

  • is_stop: is the token a stop word.

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("All aboard! \t Train NXH123 departs from platform 22 at 3:16 sharp.")
for token in doc:
print(token, token.is_space, token.is_punct, token.is_upper, token.is_digit)

This gives in the following output:

token   space?    punct?    upper?    digit?   
All     False     False     False     False
aboard  False     False     False     False
!       False     True      False     False
<tab>   True      False     False     False
Train   False     False     False     False
NXH123  False     False     True      False
departs False     False     False     False
from    False     False     False     False
platform False    False     False     False
22      False     False     False     True
at      False     False     False     False
3:16    False     False     False     False
sharp   False     False     False     False
!       False     True      False     False

Note that:

  • The tab  \t <tab>  element in the sentence has been tagged by the is_space function.

  • NXH123 has been tagged as being all uppercase characters by the  is_upper  function.

  • The number and punctuations are also properly tagged.

Lemmatize With SpaCy

You can also handle lemmatization with spaCy by using  token.lemma_  .

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("I came in and met with her teammates at the meeting.")
for token in doc:
print(f"{token.text}\t {token.lemma_} ")

This gives:

token   lemma
I       -PRON-
came    come
in      in
and     and
met     meet
with    with
her     -PRON-
teammates  teammate
at      at
the     the
meeting meeting
.       .

You can see that spaCy properly lemmatized the verbs and the plurals. It didn't provide the lemma for I and her, which in this case, would have been the same word. Instead, it tagged it as a pronoun with the tag "-PRON-". 

Let's Recap!

  • Words come in many forms, and you need to reduce the overall vocabulary size by finding a common form for all the words variations.

  • Stemming drops the end of the word to retain a stable root. It is fast, but sometimes the results are difficult to interpret.

  • Lemmatization is smarter and takes into account the meaning of the word.

  • Use spaCy to work with language-dependent models of various sizes and complexity.

  • Use spaCy to handle tokenization out of the box and offers:

    • Token analysis: punctuation, lowercase, stop words, etc.

    • Lemmatization.

    • And much more! 

  • You can find the code of this chapter in this Jupyter notebook.

In the next chapter, we will continue with information extraction. You will learn how to identify certain text elements such as emails, hashtags, or URLs based on their inherent patterns.

Example of certificate of achievement
Example of certificate of achievement