• 10 heures
  • Difficile

Ce cours est visible gratuitement en ligne.

course.header.alt.is_video

course.header.alt.is_certifying

J'ai tout compris !

Mis à jour le 15/12/2022

Apply Tokenization Techniques

The Meaning of Tokenization

The process of transforming a text into a list of its words is called tokenization, which chops up a text into words called tokens. 

In the previous chapter, we counted word frequencies by splitting the  text  over the whitespace character  ' '  with the function   text.split(‘’)  . In other words, we tokenized words by splitting the text every time there was a space. Seems to work pretty well, right? Yes, until you dive into the details.

The Punctuation Problem

Consider the two sentences:  “Let’s eat, Grandpa.”  and  “Let’s eat.”  Splitting over whitespaces results in:

Sentence

Crude space based tokenization

Number of tokens

Let’s eat, Grandpa.

  • Let’s

  • eat,

  • Grandpa.

3

Let’s eat.

  • Let’s

  • eat.

2

As you can see, we end up with two different tokens for “eat,” and “eat.

A better tokenization procedure would separate the punctuation from the words and have specific tokens for the comma and the period. For the same sentence, the tokenization could produce:

sentence

smart tokenization

number of tokens

Let's eat, Grandpa.

  • Let

  • 's

  • eat

  • ,

  • Grandpa

  • .

6

Let's eat.

  • Let

  • 's

  • eat

  • .

4

This way, the verb eat corresponds to the same unique token in both sentences.

As a general rule, good tokenization should handle the following:

  • Punctuation:  eat, => ["eat", ","]

  • Contractions:  can't => ["can", "'t"] or ["can", "'", "t"] ; doesn't => ["doesn", "'t"] or ["doesn", "'", "t"]

As you can see, you need a smarter way to tokenize than simply splitting text over spaces. Tokenization is a complex subject; however, all major NLP libraries offer reliable tokenizers! Let’s take a look at one of them: NLTK. 

Discover NLTK

The NLTK (Natural Language Toolkit) library is a Python library initially released in 2001 that covers text classification, tokenization, stemming, tagging, parsing, and many other tasks for semantic analysis. NLTK is available at http://www.nltk.org/.

In this course, we will only use NLTK for a few things: finding the right tokenizer, handling multiple words (n-grams), and completing a list of stopwords.

NLTK offers several tokenizers, all part of the Tokenizer package found in the NLTK documentation. Some are dedicated to a particular type of text—for instance, TweetTokenizer handles tweets, and  WordPunctTokenizer handles punctuation. We will use the latter for this course. Let’s see how it performs on a simple text.

# Import the tokenizer
from nltk.tokenize import WordPunctTokenizer

# Tokenize the sentence
tokens = WordPunctTokenizer().tokenize("Let's eat your soup, Grandpa.")

It returns all the punctuation and contractions from the sentence:

["Let", "'", "s", "eat", "your", "soup", ",", "Grandpa", "."]

If you apply this tokenizer on the original Earth text and list the most common tokens, you now get:

# Import the tokenizer
from nltk.tokenize import WordPunctTokenizer

# Get the text from the Earth wikipedia page
text = wikipedia_page('Earth')

# tokenize
tokens = WordPunctTokenizer().tokenize(text)

# print the 20 most commons tokens
print(Counter(tokens).most_common(20))

Returns:

> [('the', 610), (',', 546), ('.', 478), ('of', 330), ('and', 237), ('Earth', 218), ('is', 176), ('to', 159), ('in', 132), ('a', 122), ('(', 115), ('s', 113), ("'", 112), ('The', 106), ('-', 81), ('from', 70), ('that', 63), ('by', 63), ('with', 52), ('as', 52)]

Tokenize on Characters or Syllables

Tokenization is not restricted to words or punctuation. In some cases, breaking down the text into a list of its syllables can be more interesting. For example, the sentence “Earth is the third planet from the Sun.” can be tokenized with:

  • Words:   Earth; is; the; third; planet; from; the; Sun; .;
    (note that the ending period. is a token)

  • Subwords:   Ear; th; is; the; thi; rd; pla; net; from; the; Sun; .;

  • Characters:   E;a;r;t;h; ;i;s; ;t;h;e; ;t;h;i;r;d; .....
    (Note that the space character is a token.)

# example of character tokenization
char_tokens = [ c for c in text ]

# print the 20 most commons characters
print(Counter(char_tokens).most_common(10))

Returns:

> [(' ', 8261), ('e', 5175), ('t', 3989), ('a', 3718), ('i', 3019), ('o', 2985), ('s', 2788), ('r', 2764), ('n', 2722), ('h', 2053)]

The type of tokens you use depends on the task. For example, character tokenization works best for spell-checking. On the other hand, word tokens are the most common, and subword tokenization is used in recent NLP models such as BERT.

Tokenize on N-Grams

Some words are better understood together. For instance, deep learning, New York, thank you, or red hot chili peppers (the band and the spice). Therefore, when tokenizing a text, it can be helpful to consider groups of two words (bigrams) or three words (trigrams), etc. In general, groups of words taken as a single token are called n-grams.

Given a text, you can generate the n-grams from a text with  NLTK ngrams()  as such:

from nltk import ngrams
from nltk.tokenize import WordPunctTokenizer

text = "How much wood would a woodchuck chuck if a woodchuck could chuck wood?"

# Tokenize
tokens = WordPunctTokenizer().tokenize(text)

# Only keep the bigrams 
bigrams = [w for w in  ngrams(tokens,n=2)]

print(bigrams)

Returns:

[('How', 'much'), ('much', 'wood'), ('wood', 'would'), ('would', 'a'), ('a', 'woodchuck'), ('woodchuck', 'chuck'), ('chuck', 'if'), ('if', 'a'), ('a', 'woodchuck'), ('woodchuck', 'could'), ('could', 'chuck'), ('chuck', 'wood'), ('wood', '?')]

And for trigrams, you get:

# trigrams
trigrams = ['_'.join(w) for w in  ngrams(tokens,n=3)]
print(trigrams)
[('How', 'much', 'wood'), ('much', 'wood', 'would'), ('wood', 'would', 'a'), ('would', 'a', 'woodchuck'), ('a', 'woodchuck', 'chuck'), ('woodchuck', 'chuck', 'if'), ('chuck', 'if', 'a'), ('if', 'a', 'woodchuck'), ('a', 'woodchuck', 'could'), ('woodchuck', 'could', 'chuck'), ('could', 'chuck', 'wood'), ('chuck', 'wood', '?')]

You can create new 2-words tokens by joining the bigrams over “_”:

bi_tokens = ['_'.join(w) for w in bigrams]
print(bi_tokens)
['How_much', 'much_wood', 'wood_would', 'would_a', 'a_woodchuck', 'woodchuck_chuck', 'chuck_if', 'if_a', 'a_woodchuck', 'woodchuck_could', 'could_chuck', 'chuck_wood', 'wood_?']

Your Turn: Get Some Practice!

Stopwords, tokenizers, and word clouds may seem simple to implement, but the devil is in the details when working on a text. It’s important to get some practice with these powerful tools.

Here are some steps to follow:

  • Find a Wikipedia page, a text from Project Gutenberg, or any other NLP dataset.

  • Tokenize the text using NLTK  WordPunctTokenizer  .

  • Explore the list of tokens and their frequency.

  • Experiment with the  WordCloud()  parameters to generate different word clouds from the original text:

    • collocations = False

    • normalize_plurals = True or False

    • include_numbers = True or False

    • min_word_length

    • stopwords

  • Remove stopwords from the original text.

  • Use  string.punctuation  and  string.digits  to remove punctuation and numbers.

You can find a solution in this Jupyter Notebook

Let’s Recap!

  • Splitting on whitespaces does consider punctuation or contractions and does not scale.

  • NLTK offers several tokenizers, all part of the NLTK Tokenizer Package you can find in the documentation. Some are dedicated to a particular type of text, so choose the one that fits best.  

  • Tokenization is not limited to words. For certain use cases and recent models, character or syllable-based tokens are more efficient.

  • N-grams are groups of words taken as a single token. You can generate them with  NLTK.ngrams()  .

  • The size of the vocabulary increases the necessary computing power. 

The text-cleaning job is far from over, as words usually take multiple forms. Think about plurals, conjugations, or even declinations (home, house). For instance, is, are, or am all boil down to the verb “to be.”

In the next chapter, we will look at stemming and lemmatization: two common techniques that transform any word into a unique root form.

Exemple de certificat de réussite
Exemple de certificat de réussite