The Meaning of Tokenization
In the previous chapter, we counted word frequencies by splitting the text over the whitespace character ' ' with the function text.split(' ')
. In other words, we tokenized words by splitting the text every time there was a space. Seems to work pretty well, right? Yes, until you dive into the details.
The Punctuation Problem
Consider the sentence "Let's eat, Grandpa." Splitting over whitespaces results in:
Sentence | Crude space based tokenization | Number of tokens |
Let's eat, Grandpa. |
| 3 |
Let's eat. |
| 2 |
As you can see, we end up with two different tokens for "eat,"
and "eat."
.
A better tokenization procedure would separate the punctuation from the words and have specific tokens for the comma and the period. For the same sentence, the tokenization could produce:
sentence | smart tokenization | number of tokens |
Let's eat, Grandpa. |
| 6 |
Let's eat. |
| 4 |
This way, the verb eat corresponds to the same unique token in both sentences.
As a general rule, good tokenization should handle:
Punctuation:
eat, => ["eat", ","]
Contractions:
can't => ["can", "'t"] or ["can", "'", "t"] ; doesn't => ["doesn", "'t"] or ["doesn", "'", "t"]
As you can see, you need a smarter way to tokenize than simply splitting text over spaces. Tokenization is a complex subject; however, all major NLP libraries offer reliable tokenizers! Let's take a look at one of them: NLTK.
Discover NLTK
The NLTK (Natural Language Toolkit) library is a Python library initially released in 2001 that covers text classification, tokenization, stemming, tagging, parsing, and many other tasks for semantic analysis. NLTK is available at http://www.nltk.org/.
In this course, we will only use NLTK for a few things: finding the right tokenizer, handling multiple words (n-grams), and completing a list of stop words.
NLTK offers several tokenizers, all part of the Tokenizer package, found in the NLTK documentation. Some are dedicated to a particular type of text—for instance, TweetTokenizer handles tweets, and WordPunctTokenizer handles punctuation. We will use the latter for this course. Let's see how it performs on a simple text.
from nltk.tokenize import WordPunctTokenizer
tokens = WordPunctTokenizer().tokenize("Let's eat your soup, Grandpa.")
It returns all the punctuation and contractions from the sentence:
["Let", "'", "s", "eat", "your", "soup", ",", "Grandpa", "."]
If you apply this tokenizer on the original Earth text and list the most common tokens, you now get:
from nltk.tokenize import WordPunctTokenizer
text = wikipedia_page('Earth')
tokens = WordPunctTokenizer().tokenize(text)
print(Counter(tokens).most_common(20))
> [('the', 610), (',', 546), ('.', 478), ('of', 330), ('and', 237), ('Earth', 218), ('is', 176), ('to', 159), ('in', 132), ('a', 122), ('(', 115), ('s', 113), ("'", 112), ('The', 106), ('-', 81), ('from', 70), ('that', 63), ('by', 63), ('with', 52), ('as', 52)]
Tokenize on Characters or Syllables
Tokenization is not restricted to words or punctuation. It can be more interesting to break down the text into a list of its syllables in some cases. For example, the sentence "Earth is the third planet from the Sun." can be tokenized with:
Words:
Earth; is; the; third; planet; from; the; Sun; .;
(note that the ending period . is a token)Subwords:
Ear; th; is; the; thi; rd; pla; net; from; the; Sun; .;
Characters:
E;a;r;t;h; ;i;s; ;t;h;e; ;t;h;i;r;d; .....
(Note that the space character is a token.)
# example of character tokenization
char_tokens = [ c for c in text ]
print(Counter(char_tokens).most_common(10))
> [(' ', 8261), ('e', 5175), ('t', 3989), ('a', 3718), ('i', 3019), ('o', 2985), ('s', 2788), ('r', 2764), ('n', 2722), ('h', 2053)]
The type of tokens you use depends on the task. Character tokenization works best for spell checking. Word tokens are the most common, and subword tokenization is used in recent NLP models such as BERT.
Tokenize on N-Grams
Some words are better understood together. For instance, deep learning, New York, love at first sight, or The New England Journal of Medicine. Therefore, when tokenizing a text, it can be useful to consider groups of two words (bigrams) or three words (trigrams), etc. In general, groups of words taken as a single token are called n-grams.
Given a text, you can generate the n-grams from a text with NLTK ngrams()
as such:
from nltk import ngrams
from nltk.tokenize import WordPunctTokenizer
text = "How much wood would a woodchuck chuck if a woodchuck could chuck wood?"
# Tokenize
tokens = WordPunctTokenizer().tokenize(text)
# bigrams
bigrams = [w for w in ngrams(tokens,n=2)]
print(bigrams)
[('How', 'much'), ('much', 'wood'), ('wood', 'would'), ('would', 'a'), ('a', 'woodchuck'), ('woodchuck', 'chuck'), ('chuck', 'if'), ('if', 'a'), ('a', 'woodchuck'), ('woodchuck', 'could'), ('could', 'chuck'), ('chuck', 'wood'), ('wood', '?')]
# trigrams
trigrams = ['_'.join(w) for w in ngrams(tokens,n=3)]
print(trigrams)
[('How', 'much', 'wood'), ('much', 'wood', 'would'), ('wood', 'would', 'a'), ('would', 'a', 'woodchuck'), ('a', 'woodchuck', 'chuck'), ('woodchuck', 'chuck', 'if'), ('chuck', 'if', 'a'), ('if', 'a', 'woodchuck'), ('a', 'woodchuck', 'could'), ('woodchuck', 'could', 'chuck'), ('could', 'chuck', 'wood'), ('chuck', 'wood', '?')]
You can create new multi words tokens by joining the n-grams over "_":
bi_tokens = ['_'.join(w) for w in bigrams]
print(bi_tokens)
['How_much', 'much_wood', 'wood_would', 'would_a', 'a_woodchuck', 'woodchuck_chuck', 'chuck_if', 'if_a', 'a_woodchuck', 'woodchuck_could', 'could_chuck', 'chuck_wood', 'wood_?']
Your Turn: Get Some Practice!
Stop words, tokenizers, and word clouds may seem simple to implement, but the devil is in the details when working on a text. It's important to get some practice with these powerful tools.
Here are some steps to follow:
Find a Wikipedia page, a text from Project Gutenberg, or any other NLP dataset.
Tokenize the text using NLTK
WordPunctTokenizer
.Explore the list of tokens and their frequency.
Experiment with the
WordCloud()
parameters to generate different word clouds from the original text:collocations = False
normalize_plurals = True or False
include_numbers = True or False
min_word_length
stopwords
Remove stop words from the original text.
Use
string.punctuation
andstring.digits
to remove punctuation and numbers.
You can find a solution in this Jupyter Notebook.
Let's Recap!
Splitting on whitespaces does not take into account punctuation or contractions and does not scale.
NLTK offers several tokenizers, all part of the Tokenizer package, found in the NLTK documentation. Some are dedicated to a particular type of text so choose the one that fits best.
Tokenization is not limited to words. For certain use cases and recent models, character or syllable based tokens are more efficient.
N-grams are groups of words taken as a single token. You can generate them with NLTK.ngrams() .
Always keep in mind that the size of the vocabulary directly impacts the necessary computing power.
The text cleaning job is far from over as words usually take multiple forms. Think about plurals, conjugations, or even declinations (home, house). For instance, the words is, are, or am all boil down to the verb to be.
In the next chapter, we will look at stemming and lemmatization: two common techniques that transform any word into a unique root form.