• 10 hours
  • Hard

Free online content available in this course.


Got it!

Last updated on 3/4/22

Remove Stop Words From a Block of Text

Log in or subscribe for free to enjoy all this course has to offer!

This chapter will guide you through removing stop words, and by the end, you will have created a much clearer version of the previous chapter's word cloud!

Count Word Frequencies

First let's download the Wikipedia Earth page's content in a friendly text format using the wikipedia_page(title)function defined in the previous chapter.

import requests
text = wikipedia_page('Earth').lower()
print(text[:200] + '...')

Now that you have downloaded the text, the next step is to count each word's frequency. It will help identify non-subject specific words that appear too frequently (i.e., and or the).

Remember, the definition of stop words?

Stop words are words that do not provide any useful information to infer content or nature. It may be either because they don't carry any meaning (prepositions, conjunctions, etc.) or because they are too frequent.

Counting word frequencies is a three-step process:

  • The first step is to create a list of the most frequent words by splitting the text over the whitespace character ' ' with the function text.split(' ').

  • Then, count how many times each word appears in the text using the  Counter  function.

from collections import Counter
# we transform the text into a list of words
# by splitting over the space character ' '
word_list = text.split(' ')
# and count the words
word_counts = Counter(word_list)

The result:  word_counts  is a dictionary whose keys are all the text's different words and whose values are the number of times each word is present in the text.

  • Here is the list of 20 most common words with  word_counts.most_common(20)  .

for w in word_counts.most_common(20):
print(f"{w[0]}: \t{w[1]} ")
the: 674
of: 330
and: 230
is: 173
to: 157
in: 145
a: 124
earth: 78
from: 75
earth's: 75
by: 69
that: 63
as: 57
at: 56
with: 52

Remove Stop Words

It's time to get rid of all the meaningless words from that list.

# transform the text into a list of words
words_list = text.split(' ')
# define the list of words you want to remove from the text
stopwords = ['the', 'of', 'and', 'is','to','in','a','from','by','that', 'with', 'this', 'as', 'an', 'are','its', 'at', 'for']
# use a python list comprehension to remove the stopwords from words_list
words_without_stopwords = [ word for word in words_list if word not in stopwords ]

The list of the top 20 most frequent words is now very different:

> [('earth', 78), ("earth's", 75), ('on', 50), ('about', 45), ('solar', 36), ('million', 36), ('surface', 34), ('life', 30), ('it', 30), ('sun', 26), ('other', 26), ('has', 26), ('was', 26), ('have', 25), ('or', 25), ('than', 23), ('which', 22), ('be', 22), ('over', 21), ('into', 21)]

The top 20 now contains multiple meaningful words such as million, surface, life, solar, and sun. Although there are still some stop words we could get rid of (on, it, has, etc.), you can see that removing only a handful improves the value of the word cloud.

Wordcloud of the Wikipedia Page for
A word cloud of Wikipedia's Earth page with common stop words removed.

Use a Predefined List of Stop Words

There is no fixed or optimal list of stop words in any language; they depend on your text context. For example, temperature numbers would be significant for a text about the weather, but less so for a text about songs or legal documents.

All major NLP libraries come with their own predefined set of stop words. You can find very extensive and thorough lists online.

  • You can find stop words in several languages in this GitHub repo

  • You can find stop words compiled from multiple sources in this GitHub repo.

The wordcloud library comes with a predefined list of 192 stop words. Here are the 20 from that list:

> ['because', 'i', 'myself', "shouldn't", 'down', 'your', 'above', 'been', "i'm", 'again', 'that', 'during', 'being', 'was', 'before', "wasn't", 'ought', 'and', 'own', 'both']

You can also set your own list of stop words with the stopwords parameter WordCloud(stopwords = [...]), but by default, the wordcloud library will use its own list.

Let's put this list into action now and see what happens.

Wordcloud of the Wikipedia Page for
The Wikipedia Earth page with default stop words removed.

Let's Recap!

  • Transforming a text into a list of words is called tokenization, and each word is a token.

  • A simple way to tokenize is to split the text over whitespaces with  text.split(' ')  .

  • There is no fixed or optimal list of stop words in each language. They really depend on your context.

  • By default, if you do not specify a stopword  parameter value, the wordcloud  library uses the built-in list of 192 stop words.

In this chapter, you learned how to remove stop words from a text. You also learned simple tokenization by splitting the text over whitespaces to get a list of words. Tokenization? What now? Let’s explore what this means in the next chapter!

Example of certificate of achievement
Example of certificate of achievement