• 10 heures
  • Difficile

Ce cours est visible gratuitement en ligne.

course.header.alt.is_video

course.header.alt.is_certifying

J'ai tout compris !

Mis à jour le 15/12/2022

Do More With spaCy

If you apply a spaCy model to a text, you get tokenization, lemmatization, and morphological flags (space, uppercase, digit, etc.) of each token.

During that operation, spaCy also performs two essential tasks:

  • Part-of-speech tagging (POS): finding the grammatical nature of a word in the sentence. Grammatical nature includes nouns, adjectives, prepositions, verbs, etc. 

  • Named entity recognition (NER): the ability to automatically identify and extract key items or entities in a text (names of persons, companies, products, or locations but also days, medical terms, or quantities).

POS and NER are essential tasks in NLP. POS is used for information extraction (finding all the adjectives associated with a person or a product, for example) and facilitates language understanding for complex NLP tasks (text generation, for instance). NER is used across many domains to identify specific entities from the text (medical terms, legal concepts, people, etc.).

Identify the Nature of a Word With Part-of-Speech Tagging

Part-of-speech tagging (POS) identifies the nature of the word in a sentence. The example in the previous chapter underlines this. For instance, in the sentence “I met him during the meeting,” the word meet is tagged as a verb and a noun. 

How can we use POS?

For instance, extracting the adjectives related to a particular entity (a person) in online social content would give you a good understanding of people’s opinions of that person.

POS is also used for word-sense disambiguation.

POS with spaCy is similar to lemmatization. Apply a spaCy model to a text with   doc = nlp(text)  , loop over each token, and print out the token attribute  pos_  .

Let’s try an example with that quote from Alice in Wonderland:

If you don’t know where you are going, any road can take you there.

Alice in Wonderland - Cheshire Cat

import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp("If you don't know where you are going any road can take you there.")

# print the nature of each token
for token in doc:
   print(f"{token.text}\t {token.pos_} ")

This results in:

Token   POS
If      SCONJ
you     PRON
do      AUX
n't     PART
know    VERB
where   SCONJ
you     PRON
are     AUX
going   VERB
any     DET
road    NOUN
can     AUX
take    VERB
you     PRON
there   ADV
.       PUNCT

There are verbs, pronouns, punctuation signs, adverbs, auxiliary verbs, determinants, and a conditional term.

I remember my middle school English teacher citing this verse from Shakespeare’s Richard II, Act 2, Scene 3, to illustrate the polymorphic nature of English words. In this verse, grace and uncle are both used as nouns (as expected) and verbs.

Grace me no grace, nor uncle me no uncle

spaCy doesn’t have an issue finding this distinction. This code:

doc = nlp("Grace me no grace, nor uncle me no uncle")
for t in doc: print(t, t.pos_)

Results in:

Grace VERB ... grace NOUN ... uncle VERB ... uncle NOUN

On the other hand, the NLTK universal tagger fails to correctly parse the sentence. spaCy 1, NLTK 0!

import nltk
nltk.download('universal_tagset')
text = nltk.word_tokenize("Grace me no grace, nor uncle me no uncle")
nltk.pos_tag(text,tagset='universal')

Returns:

> [('Grace', 'NOUN'), ..., ('grace', 'NOUN'), ..., ('uncle', 'ADP'), ..., ('uncle', 'NOUN')]
# ADP here is an Adposition (it's complicated)

Grammar is great, but what if you could extract names, locations, nationalities, or company names from a text?

That’s where named entity recognition comes in.

Extract Real-World Objects With Named Entity Recognition

Named entity recognition (NER) identifies real-world objects (i.e., anything that can be denoted with a proper name) and classifies them into predefined categories.

Out of the box, spaCy can identify the following:

  • PERSON: people (existing or fictional).

  • LOC: locations.

  • ORG: organizations such as companies, agencies, institutions, organizations, etc.

  • GPE: countries, cities, states.

It can also identify:

  • Date and time.

  • Percentages.

  • Money, events, works of art, and languages.

When applying a model to a text (with  doc = nlp(text)  ) , spaCy applies the NER model previously loaded (with  nlp = spacy.load(model)  ) to find all the entities in the text. The information is made available in the iterable   doc.ents  .

For example, let’s use the NER model to find the most frequent characters from Alice in Wonderland. You would expect Alice, the Queen, and of course the Rabbit to be the most frequent, right?

You can load the text directly from Project Gutenberg with:

import requests
import spacy
from collections import Counter

nlp = spacy.load("en_core_web_sm")

# text from Alice in Wonderland
r = requests.get('http://www.gutenberg.org/files/11/11-0.txt')

# remove the footer and parse the text
doc = nlp(r.text.split("*** END")[0])

# Find all the 'persons' in the text
persons = []
# For each entity in the doc 
for ent in doc.ents:
    # if the entity is a person
    if ent.label_ == 'PERSON':
        # add to the list of persons
        persons.append(ent.text)

# note we could have written the last bit in one line with
persons = [ent.text for ent in doc.ents if ent.label_ == 'PERSON']

# list the 12 most common ones
Counter(persons).most_common(12)

This gives:

[('Alice', 311),  ('Gryphon', 52),  ('Queen', 43),  ('Duchess', 27),  ('Hatter', 17),  ('the Mock Turtle', 16),  ('Cat', 15),  ('Mouse', 13),  ('Dinah', 10),  ('Bill', 8),  ('Majesty', 8),  ('Rabbit', 7)]

Although the Rabbit is a major character in the book, he only comes up seven times as a person. Maybe spaCy identified the Rabbit as some other type of entity? Let’s find out.

This code:

rabbit_ner = [(ent.text, ent.label_) for ent in doc.ents if "Rabbit" in ent.text]
Counter(rabbit_ner).most_common(10)

Returns:

[(('the White Rabbit', 'ORG'), 11),  (('Rabbit', 'PERSON'), 7),  (('Rabbit', 'PRODUCT'), 3),  (('White Rabbit', 'ORG'), 3),  (('The Rabbit Sends', 'PRODUCT'), 1),  (('Rabbit', 'EVENT'), 1),  (('Rabbit', 'FAC'), 1),  (('Rabbit', 'ORG'), 1),  (('the White\r\nRabbit', 'ORG'), 1),  (('The White Rabbit', 'ORG'), 1)]

Wow! Not what you expected, eh?

This example shows that even on a classic, well-formatted and clean text, spaCy struggles to identify straightforward entities correctly. This is mostly a consequence of the model we used for the NER task.

The small  en_core_web_sm  model (11 MB) we loaded to parse Alice in Wonderland is trained on OntoNotes, a corpus of 1,745k web-based English texts. In contrast, the larger  en_core_web_lg  model (782 MB) is trained on OntoNotes and a subset of the staggering Common Crawl data. You’d expect better NER results using that larger model!

Let’s Recap!

  • Part-of-speech (POS) tagging is the task of finding the grammatical nature of the words in a sentence: nouns, verbs, adjectives, etc.

  • Named entity recognition (NER) identifies persons, places, and organizations in a text.

  • Use spaCy to do POS and NER  right out of the box.

  • POS is a key component in NLP applications such as word sense disambiguation.

  • As you saw in Alice in Wonderland, NER is not as straightforward as POS and requires extra preprocessing to identify entities.

This concludes Part I of the course. So far, you’ve sliced and diced text into tokens to simplify it. In the next part, you will transform text into numbers to make it more computer friendly. This process is called vectorization.

Exemple de certificat de réussite
Exemple de certificat de réussite