One more chapter? Yes! Everyone loves bonus information, right?
We’ve already taken a deep dive into using spaCy for lemmatization and word embeddings. But spaCy does so much more, and it would be remiss not to share it with you! The following functionalities are powerful and may come in handy one day!
Part-of-Speech Tagging
You can use spaCy for part-of-speech tagging right out of the box with token.pos_
. Let's apply that to a quote from Alice in Wonderland:
If you don’t know where you are going any road can take you there.
Alice in Wonderland - Cheshire Cat
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("If you don’t know where you are going any road can take you there.")
for token in doc:
print(f"{token.text}\t {token.pos_} ")
This results in:
If SCONJ you PRON do AUX n’t PART know VERB where ADV you PRON are AUX going VERB any DET road NOUN can VERB take VERB you PRON there ADV . PUNCT
There are verbs, pronouns, punctuation signs, adverbs, auxiliary verbs, determinants and even a conditional term.
I still remember my middle school English teacher citing this verse from Richard II, Act 2 Scene 3, to illustrate English words' polymorphic nature. In this verse, the words grace and uncle are both used as nouns (as expected) and as verbs.
Grace me no grace, nor uncle me no uncle
This is a distinction that spaCy finds without issue. This code:
doc = nlp("Grace me no grace, nor uncle me no uncle")
for t in doc: print(t, t.pos_)
Gives:
Grace VERB ... grace NOUN ... uncle VERB ... uncle NOUN
However, the NLTK universal tagger fails!
import nltk
nltk.download('universal_tagset')
text = nltk.word_tokenize("Grace me no grace, nor uncle me no uncle")
nltk.pos_tag(text1,tagset='universal')
> [('Grace', 'NOUN'), ..., ('grace', 'NOUN'), ..., ('uncle', 'ADP'), ..., ('uncle', 'NOUN')]
# ADP here is an Adposition (it's complicated)
Applications
Part-of-speech tagging can be applied to sentiment analysis, named entity recognition, and word-sense disambiguation.
Grammatical information is interesting but what if you could extract names, locations, nationalities or company names from a text?
That's where named-entity recognition (NER) comes in.
Named-Entity Recognition
Named-entity recognition is the task of identifying real-world objects (i.e., anything that can be denoted with a proper name) and classifying them into pre-defined categories.
The spaCy NER models can identify:
PERSON: people, existing, or fictional.
LOC: locations.
ORG: organizations such as companies, agencies, institutions, organizations, etc.
GPE: countries, cities, states.
They can also identify:
Date and time.
Percentages.
Money, events, works of art, and languages.
When applying a model to a text (withdoc = nlp(text)
) , spaCy applies the NER model previously loaded (withnlp = spacy.load(model)
) to find all the entities in the text. The information is made available in the iterable doc.ents
.
Application
Let's use the NER model to find the most frequent characters from Alice in Wonderland. You would expect Alice, the Queen, and of course the Rabbit to be the most frequent, right?
You can load the text directly from the Project Gutenberg with:
import requests
import spacy
nlp = spacy.load("en_core_web_sm")
r = requests.get('http://www.gutenberg.org/files/11/11-0.txt')
doc = nlp(r.text.split("*** END")[0])
# collect all the entities that are tagged PERSON
persons = [ent.text for ent in doc.ents if ent.label_ == 'PERSON']
# and list the 12 most common ones
Counter(persons).most_common(12)
This gives:
[('Alice', 311), ('Gryphon', 52), ('Queen', 43), ('Duchess', 27), ('Hatter', 17), ('the Mock Turtle', 16), ('Cat', 15), ('Mouse', 13), ('Dinah', 10), ('Bill', 8), ('Majesty', 8), ('Rabbit', 7)]
Although the Rabbit is a major character in the book, it only comes up seven times as a person. Maybe spaCy identified the Rabbit as some other type of entity? Let's find out.
This code:
rabbit_ner = [(ent.text, ent.label_) for ent in doc.ents if "Rabbit" in ent.text]
Counter(rabbit_ner).most_common(10)
Returns:
[(('the White Rabbit', 'ORG'), 11), (('Rabbit', 'PERSON'), 7), (('Rabbit', 'PRODUCT'), 3), (('White Rabbit', 'ORG'), 3), (('The Rabbit Sends', 'PRODUCT'), 1), (('Rabbit', 'EVENT'), 1), (('Rabbit', 'FAC'), 1), (('Rabbit', 'ORG'), 1), (('the White\r\nRabbit', 'ORG'), 1), (('The White Rabbit', 'ORG'), 1)]
Wow! Not what you expected, eh?
This example shows that even on a classic, well-formatted and clean text, spaCy struggles to correctly identify straightforward entities mostly due to the nature, diversity, and volume of the data used to train the model in the first place.
The small en_core_web_sm
model (11 MB) that we loaded to parse Alice in Wonderland was trained on OntoNotes, a corpus of 1,745k web based English texts. That could be why! The larger en_core_web_lg
model (782 MB) is trained on OntoNotes and a subset of the staggering Common Crawl data. You'd expect better NER results using that larger model!
Let's Recap!
Part-of-speech (POS) tagging is the task of finding the grammatical nature of the words in a sentence: nouns, verbs, adjectives, etc.
Named-entity recognition (NER) is the task of identifying persons, places and organizations in a text.
Use spaCy to do POS and NER right out of the box.
POS is a key component in NLP applications such as sentiment analysis, named-entity recognition, and word sense disambiguation.
As shown in Alice in Wonderland, NER is not as straightforward as POS and requires extra preprocessing to identify entities.
Okay, now it's really done. Good luck on your future NLP adventures!