• 10 hours
  • Hard

Free online content available in this course.

course.header.alt.is_certifying

Got it!

Last updated on 1/28/21

Bonus! Doing More with SpaCy

Log in or subscribe for free to enjoy all this course has to offer!

One more chapter? Yes! Everyone loves bonus information, right?

We’ve already taken a deep dive into using spaCy for lemmatization and word embeddings. But spaCy does so much more, and it would be remiss not to share it with you! The following functionalities are powerful and may come in handy one day!

Part-of-Speech Tagging

You can use spaCy for part-of-speech tagging  right out of the box with token.pos_. Let's apply that to a quote from Alice in Wonderland

If you don’t know where you are going any road can take you there.

Alice in Wonderland - Cheshire Cat

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("If you don’t know where you are going any road can take you there.")
for token in doc:
print(f"{token.text}\t {token.pos_} ")

This results in:

If     SCONJ
you    PRON
do     AUX
n’t    PART
know   VERB
where  ADV
you    PRON
are    AUX
going  VERB
any    DET
road   NOUN
can    VERB
take   VERB
you    PRON
there  ADV
.      PUNCT

There are verbs, pronouns, punctuation signs, adverbs, auxiliary verbs, determinants and even a conditional term.

I still remember my middle school English teacher citing this verse from Richard II, Act 2 Scene 3, to illustrate English words' polymorphic nature. In this verse, the words grace and uncle are both used as nouns (as expected) and as verbs.

Grace me no grace, nor uncle me no uncle

This is a distinction that spaCy finds without issue. This code:

doc = nlp("Grace me no grace, nor uncle me no uncle")
for t in doc: print(t, t.pos_)

Gives:

Grace VERB
...
grace NOUN
...
uncle VERB
...
uncle NOUN

However, the NLTK universal tagger fails!

import nltk
nltk.download('universal_tagset')
text = nltk.word_tokenize("Grace me no grace, nor uncle me no uncle")
nltk.pos_tag(text1,tagset='universal')
> [('Grace', 'NOUN'), ..., ('grace', 'NOUN'), ..., ('uncle', 'ADP'), ..., ('uncle', 'NOUN')]
# ADP here is an Adposition (it's complicated)

Applications

Part-of-speech tagging can be applied to sentiment analysis, named entity recognition, and word-sense disambiguation.

Grammatical information is interesting but what if you could extract names, locations, nationalities or company names from a text?

That's where named-entity recognition (NER) comes in. 

Named-Entity Recognition

Named-entity recognition is the task of identifying real-world objects (i.e., anything that can be denoted with a proper name) and classifying them into pre-defined categories.

The spaCy NER models can identify:

  • PERSON: people, existing, or fictional.

  • LOC: locations.

  • ORG: organizations such as companies, agencies, institutions, organizations, etc.

  • GPE: countries, cities, states.

They can also identify:

  • Date and time.

  • Percentages.

  • Money, events, works of art, and languages.

When applying a model to a text (withdoc = nlp(text)) , spaCy applies the NER model previously loaded (withnlp = spacy.load(model)) to find all the entities in the text. The information is made available in the iterable doc.ents .

Application

Let's use the NER model to find the most frequent characters from Alice in Wonderland. You would expect Alice, the Queen, and of course the Rabbit to be the most frequent, right?

You can load the text directly from the Project Gutenberg with:

import requests
import spacy
nlp = spacy.load("en_core_web_sm")
r = requests.get('http://www.gutenberg.org/files/11/11-0.txt')
doc = nlp(r.text.split("*** END")[0])
# collect all the entities that are tagged PERSON
persons = [ent.text for ent in doc.ents if ent.label_ == 'PERSON']
# and list the 12 most common ones
Counter(persons).most_common(12)

This gives:

[('Alice', 311),
 ('Gryphon', 52),
 ('Queen', 43),
 ('Duchess', 27),
 ('Hatter', 17),
 ('the Mock Turtle', 16),
 ('Cat', 15),
 ('Mouse', 13),
 ('Dinah', 10),
 ('Bill', 8),
 ('Majesty', 8),
 ('Rabbit', 7)]

Although the Rabbit is a major character in the book, it only comes up seven times as a person. Maybe spaCy identified the Rabbit as some other type of entity? Let's find out.

This code:

rabbit_ner = [(ent.text, ent.label_) for ent in doc.ents if "Rabbit" in ent.text]
Counter(rabbit_ner).most_common(10)

Returns:

[(('the White Rabbit', 'ORG'), 11),
 (('Rabbit', 'PERSON'), 7),
 (('Rabbit', 'PRODUCT'), 3),
 (('White Rabbit', 'ORG'), 3),
 (('The Rabbit Sends', 'PRODUCT'), 1),
 (('Rabbit', 'EVENT'), 1),
 (('Rabbit', 'FAC'), 1),
 (('Rabbit', 'ORG'), 1),
 (('the White\r\nRabbit', 'ORG'), 1),
 (('The White Rabbit', 'ORG'), 1)]

Wow! Not what you expected, eh?

This example shows that even on a classic, well-formatted and clean text, spaCy struggles to correctly identify straightforward entities mostly due to the nature, diversity, and volume of the data used to train the model in the first place.

The small en_core_web_sm model (11 MB) that we loaded to parse Alice in Wonderland was trained on OntoNotes, a corpus of 1,745k web based English texts. That could be why! The larger en_core_web_lg model (782 MB) is trained on OntoNotes and a subset of the staggering Common Crawl data. You'd expect better NER results using that larger model!

Let's Recap!

  • Part-of-speech (POS) tagging is the task of finding the grammatical nature of the words in a sentence: nouns, verbs, adjectives, etc.

  • Named-entity recognition (NER) is the task of identifying persons, places and organizations in a text.

  • Use spaCy to do POS and NER  right out of the box.

  • POS is a key component in NLP applications such as sentiment analysis, named-entity recognition, and word sense disambiguation.

  • As shown in Alice in Wonderland, NER is not as straightforward as POS and requires extra preprocessing to identify entities.

Okay, now it's really done. Good luck on your future NLP adventures! 

Example of certificate of achievement
Example of certificate of achievement