The Meaning of Tokenization
The process of transforming a text into a list of its words is called tokenization, which chops up a text into words called tokens.
In the previous chapter, we counted word frequencies by splitting the text
over the whitespace character ' '
with the function text.split(‘’)
. In other words, we tokenized words by splitting the text every time there was a space. Seems to work pretty well, right? Yes, until you dive into the details.
The Punctuation Problem
Consider the two sentences: “Let’s eat, Grandpa.” and “Let’s eat.” Splitting over whitespaces results in:
Sentence | Crude space based tokenization | Number of tokens |
Let’s eat, Grandpa. |
| 3 |
Let’s eat. |
| 2 |
As you can see, we end up with two different tokens for “eat,” and “eat.”
A better tokenization procedure would separate the punctuation from the words and have specific tokens for the comma and the period. For the same sentence, the tokenization could produce:
sentence | smart tokenization | number of tokens |
Let's eat, Grandpa. |
| 6 |
Let's eat. |
| 4 |
This way, the verb eat corresponds to the same unique token in both sentences.
As a general rule, good tokenization should handle the following:
Punctuation:
eat, => ["eat", ","]
Contractions:
can't => ["can", "'t"] or ["can", "'", "t"] ; doesn't => ["doesn", "'t"] or ["doesn", "'", "t"]
As you can see, you need a smarter way to tokenize than simply splitting text over spaces. Tokenization is a complex subject; however, all major NLP libraries offer reliable tokenizers! Let’s take a look at one of them: NLTK.
Discover NLTK
The NLTK (Natural Language Toolkit) library is a Python library initially released in 2001 that covers text classification, tokenization, stemming, tagging, parsing, and many other tasks for semantic analysis. NLTK is available at http://www.nltk.org/.
In this course, we will only use NLTK for a few things: finding the right tokenizer, handling multiple words (n-grams), and completing a list of stopwords.
NLTK offers several tokenizers, all part of the Tokenizer package found in the NLTK documentation. Some are dedicated to a particular type of text—for instance, TweetTokenizer handles tweets, and WordPunctTokenizer handles punctuation. We will use the latter for this course. Let’s see how it performs on a simple text.
# Import the tokenizer
from nltk.tokenize import WordPunctTokenizer
# Tokenize the sentence
tokens = WordPunctTokenizer().tokenize("Let's eat your soup, Grandpa.")
It returns all the punctuation and contractions from the sentence:
["Let", "'", "s", "eat", "your", "soup", ",", "Grandpa", "."]
If you apply this tokenizer on the original Earth text and list the most common tokens, you now get:
# Import the tokenizer
from nltk.tokenize import WordPunctTokenizer
# Get the text from the Earth wikipedia page
text = wikipedia_page('Earth')
# tokenize
tokens = WordPunctTokenizer().tokenize(text)
# print the 20 most commons tokens
print(Counter(tokens).most_common(20))
Returns:
> [('the', 610), (',', 546), ('.', 478), ('of', 330), ('and', 237), ('Earth', 218), ('is', 176), ('to', 159), ('in', 132), ('a', 122), ('(', 115), ('s', 113), ("'", 112), ('The', 106), ('-', 81), ('from', 70), ('that', 63), ('by', 63), ('with', 52), ('as', 52)]
Tokenize on Characters or Syllables
Tokenization is not restricted to words or punctuation. In some cases, breaking down the text into a list of its syllables can be more interesting. For example, the sentence “Earth is the third planet from the Sun.” can be tokenized with:
Words:
Earth; is; the; third; planet; from; the; Sun; .;
(note that the ending period.
is a token)Subwords:
Ear; th; is; the; thi; rd; pla; net; from; the; Sun; .;
Characters:
E;a;r;t;h; ;i;s; ;t;h;e; ;t;h;i;r;d; .....
(Note that the space character is a token.)
# example of character tokenization
char_tokens = [ c for c in text ]
# print the 20 most commons characters
print(Counter(char_tokens).most_common(10))
Returns:
> [(' ', 8261), ('e', 5175), ('t', 3989), ('a', 3718), ('i', 3019), ('o', 2985), ('s', 2788), ('r', 2764), ('n', 2722), ('h', 2053)]
The type of tokens you use depends on the task. For example, character tokenization works best for spell-checking. On the other hand, word tokens are the most common, and subword tokenization is used in recent NLP models such as BERT.
Tokenize on N-Grams
Some words are better understood together. For instance, deep learning, New York, thank you, or red hot chili peppers (the band and the spice). Therefore, when tokenizing a text, it can be helpful to consider groups of two words (bigrams) or three words (trigrams), etc. In general, groups of words taken as a single token are called n-grams.
Given a text, you can generate the n-grams from a text with NLTK ngrams()
as such:
from nltk import ngrams
from nltk.tokenize import WordPunctTokenizer
text = "How much wood would a woodchuck chuck if a woodchuck could chuck wood?"
# Tokenize
tokens = WordPunctTokenizer().tokenize(text)
# Only keep the bigrams
bigrams = [w for w in ngrams(tokens,n=2)]
print(bigrams)
Returns:
[('How', 'much'), ('much', 'wood'), ('wood', 'would'), ('would', 'a'), ('a', 'woodchuck'), ('woodchuck', 'chuck'), ('chuck', 'if'), ('if', 'a'), ('a', 'woodchuck'), ('woodchuck', 'could'), ('could', 'chuck'), ('chuck', 'wood'), ('wood', '?')]
And for trigrams, you get:
# trigrams
trigrams = ['_'.join(w) for w in ngrams(tokens,n=3)]
print(trigrams)
[('How', 'much', 'wood'), ('much', 'wood', 'would'), ('wood', 'would', 'a'), ('would', 'a', 'woodchuck'), ('a', 'woodchuck', 'chuck'), ('woodchuck', 'chuck', 'if'), ('chuck', 'if', 'a'), ('if', 'a', 'woodchuck'), ('a', 'woodchuck', 'could'), ('woodchuck', 'could', 'chuck'), ('could', 'chuck', 'wood'), ('chuck', 'wood', '?')]
You can create new 2-words tokens by joining the bigrams over “_”:
bi_tokens = ['_'.join(w) for w in bigrams]
print(bi_tokens)
['How_much', 'much_wood', 'wood_would', 'would_a', 'a_woodchuck', 'woodchuck_chuck', 'chuck_if', 'if_a', 'a_woodchuck', 'woodchuck_could', 'could_chuck', 'chuck_wood', 'wood_?']
Your Turn: Get Some Practice!
Stopwords, tokenizers, and word clouds may seem simple to implement, but the devil is in the details when working on a text. It’s important to get some practice with these powerful tools.
Here are some steps to follow:
Find a Wikipedia page, a text from Project Gutenberg, or any other NLP dataset.
Tokenize the text using NLTK
WordPunctTokenizer
.Explore the list of tokens and their frequency.
Experiment with the
WordCloud()
parameters to generate different word clouds from the original text:collocations = False
normalize_plurals = True or False
include_numbers = True or False
min_word_length
stopwords
Remove stopwords from the original text.
Use
string.punctuation
andstring.digits
to remove punctuation and numbers.
You can find a solution in this Jupyter Notebook.
Let’s Recap!
Splitting on whitespaces does consider punctuation or contractions and does not scale.
NLTK offers several tokenizers, all part of the NLTK Tokenizer Package you can find in the documentation. Some are dedicated to a particular type of text, so choose the one that fits best.
Tokenization is not limited to words. For certain use cases and recent models, character or syllable-based tokens are more efficient.
N-grams are groups of words taken as a single token. You can generate them with
NLTK.ngrams()
.The size of the vocabulary increases the necessary computing power.
The text-cleaning job is far from over, as words usually take multiple forms. Think about plurals, conjugations, or even declinations (home, house). For instance, is, are, or am all boil down to the verb “to be.”
In the next chapter, we will look at stemming and lemmatization: two common techniques that transform any word into a unique root form.