If you apply a spaCy model to a text, you get tokenization, lemmatization, and morphological flags (space, uppercase, digit, etc.) of each token.
During that operation, spaCy also performs two essential tasks:
Part-of-speech tagging (POS): finding the grammatical nature of a word in the sentence. Grammatical nature includes nouns, adjectives, prepositions, verbs, etc.
Named entity recognition (NER): the ability to automatically identify and extract key items or entities in a text (names of persons, companies, products, or locations but also days, medical terms, or quantities).
POS and NER are essential tasks in NLP. POS is used for information extraction (finding all the adjectives associated with a person or a product, for example) and facilitates language understanding for complex NLP tasks (text generation, for instance). NER is used across many domains to identify specific entities from the text (medical terms, legal concepts, people, etc.).
Identify the Nature of a Word With Part-of-Speech Tagging
Part-of-speech tagging (POS) identifies the nature of the word in a sentence. The example in the previous chapter underlines this. For instance, in the sentence “I met him during the meeting,” the word meet is tagged as a verb and a noun.
How can we use POS?
For instance, extracting the adjectives related to a particular entity (a person) in online social content would give you a good understanding of people’s opinions of that person.
POS is also used for word-sense disambiguation.
POS with spaCy is similar to lemmatization. Apply a spaCy model to a text with doc = nlp(text)
, loop over each token, and print out the token attribute pos_
.
Let’s try an example with that quote from Alice in Wonderland:
If you don’t know where you are going, any road can take you there.
Alice in Wonderland - Cheshire Cat
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("If you don't know where you are going any road can take you there.")
# print the nature of each token
for token in doc:
print(f"{token.text}\t {token.pos_} ")
This results in:
Token POS If SCONJ you PRON do AUX n't PART know VERB where SCONJ you PRON are AUX going VERB any DET road NOUN can AUX take VERB you PRON there ADV . PUNCT
There are verbs, pronouns, punctuation signs, adverbs, auxiliary verbs, determinants, and a conditional term.
I remember my middle school English teacher citing this verse from Shakespeare’s Richard II, Act 2, Scene 3, to illustrate the polymorphic nature of English words. In this verse, grace and uncle are both used as nouns (as expected) and verbs.
Grace me no grace, nor uncle me no uncle
spaCy doesn’t have an issue finding this distinction. This code:
doc = nlp("Grace me no grace, nor uncle me no uncle")
for t in doc: print(t, t.pos_)
Results in:
Grace VERB ... grace NOUN ... uncle VERB ... uncle NOUN
On the other hand, the NLTK universal tagger fails to correctly parse the sentence. spaCy 1, NLTK 0!
import nltk
nltk.download('universal_tagset')
text = nltk.word_tokenize("Grace me no grace, nor uncle me no uncle")
nltk.pos_tag(text,tagset='universal')
Returns:
> [('Grace', 'NOUN'), ..., ('grace', 'NOUN'), ..., ('uncle', 'ADP'), ..., ('uncle', 'NOUN')] # ADP here is an Adposition (it's complicated)
Grammar is great, but what if you could extract names, locations, nationalities, or company names from a text?
That’s where named entity recognition comes in.
Extract Real-World Objects With Named Entity Recognition
Named entity recognition (NER) identifies real-world objects (i.e., anything that can be denoted with a proper name) and classifies them into predefined categories.
Out of the box, spaCy can identify the following:
PERSON: people (existing or fictional).
LOC: locations.
ORG: organizations such as companies, agencies, institutions, organizations, etc.
GPE: countries, cities, states.
It can also identify:
Date and time.
Percentages.
Money, events, works of art, and languages.
When applying a model to a text (with doc = nlp(text)
) , spaCy applies the NER model previously loaded (with nlp = spacy.load(model)
) to find all the entities in the text. The information is made available in the iterable doc.ents
.
For example, let’s use the NER model to find the most frequent characters from Alice in Wonderland. You would expect Alice, the Queen, and of course the Rabbit to be the most frequent, right?
You can load the text directly from Project Gutenberg with:
import requests
import spacy
from collections import Counter
nlp = spacy.load("en_core_web_sm")
# text from Alice in Wonderland
r = requests.get('http://www.gutenberg.org/files/11/11-0.txt')
# remove the footer and parse the text
doc = nlp(r.text.split("*** END")[0])
# Find all the 'persons' in the text
persons = []
# For each entity in the doc
for ent in doc.ents:
# if the entity is a person
if ent.label_ == 'PERSON':
# add to the list of persons
persons.append(ent.text)
# note we could have written the last bit in one line with
persons = [ent.text for ent in doc.ents if ent.label_ == 'PERSON']
# list the 12 most common ones
Counter(persons).most_common(12)
This gives:
[('Alice', 311), ('Gryphon', 52), ('Queen', 43), ('Duchess', 27), ('Hatter', 17), ('the Mock Turtle', 16), ('Cat', 15), ('Mouse', 13), ('Dinah', 10), ('Bill', 8), ('Majesty', 8), ('Rabbit', 7)]
Although the Rabbit is a major character in the book, he only comes up seven times as a person. Maybe spaCy identified the Rabbit as some other type of entity? Let’s find out.
This code:
rabbit_ner = [(ent.text, ent.label_) for ent in doc.ents if "Rabbit" in ent.text]
Counter(rabbit_ner).most_common(10)
Returns:
[(('the White Rabbit', 'ORG'), 11), (('Rabbit', 'PERSON'), 7), (('Rabbit', 'PRODUCT'), 3), (('White Rabbit', 'ORG'), 3), (('The Rabbit Sends', 'PRODUCT'), 1), (('Rabbit', 'EVENT'), 1), (('Rabbit', 'FAC'), 1), (('Rabbit', 'ORG'), 1), (('the White\r\nRabbit', 'ORG'), 1), (('The White Rabbit', 'ORG'), 1)]
Wow! Not what you expected, eh?
This example shows that even on a classic, well-formatted and clean text, spaCy struggles to identify straightforward entities correctly. This is mostly a consequence of the model we used for the NER task.
The small en_core_web_sm
model (11 MB) we loaded to parse Alice in Wonderland is trained on OntoNotes, a corpus of 1,745k web-based English texts. In contrast, the larger en_core_web_lg
model (782 MB) is trained on OntoNotes and a subset of the staggering Common Crawl data. You’d expect better NER results using that larger model!
Let’s Recap!
Part-of-speech (POS) tagging is the task of finding the grammatical nature of the words in a sentence: nouns, verbs, adjectives, etc.
Named entity recognition (NER) identifies persons, places, and organizations in a text.
Use spaCy to do POS and NER right out of the box.
POS is a key component in NLP applications such as word sense disambiguation.
As you saw in Alice in Wonderland, NER is not as straightforward as POS and requires extra preprocessing to identify entities.
This concludes Part I of the course. So far, you’ve sliced and diced text into tokens to simplify it. In the next part, you will transform text into numbers to make it more computer friendly. This process is called vectorization.