• 10 hours
  • Hard

Free online content available in this course.

course.header.alt.is_certifying

Got it!

Last updated on 1/28/21

Build Your First Word Cloud

Log in or subscribe for free to enjoy all this course has to offer!

Hello and welcome to the course!

My name is Alexis Perrier (@alexip), and I am a data scientist and natural language processing (NLP) expert. I primarily use NLP to help research scientists analyze trends and behaviors across social media platforms.

I am excited to share some of the tools and methods I use on the job. By the end of this course, you will also be able to apply them to your text analysis projects!

Before getting into the technique, let's take a moment to discuss natural language processing as a whole.

What Is Natural Language Processing?

Language is everywhere. It’s how I am communicating with you right now!  There is even a whole scientific discipline, called linguistics, dedicated to the study of human language, including syntax, morphology, and phonetics, whose goal is to uncover the psychological and societal working of language. Ambitious, right? ^^

Okay...but how does this relate to computer science?

Well, in the late 1950s, classic linguistics gave birth to computational linguistics by adding statistics and computers into the mix. The goal then became to build automated processes that could understand and transform text. 

Today, natural language processing (NLP) is a direct evolution of computational linguistics that leverages artificial intelligence and machine learning.

To deepen your understanding, let’s look at some real-world applications of NLP in everyday life. You will recognize a few!

Discover Real-Life Use Cases of NLP

Information Extraction

Information extraction is identifying a specific piece of information from a block of text such as a name, a location, or a hashtag.

For example, in human resources, natural language processing can help match the right candidate to the right job. Imagine that you have a bank of thousands of CVs. That's thousands and thousands of words. The ability to extract a specific job title from all those words makes job matching a lot more efficient.

Speech-to-Text Conversion

Have you ever used the dictation tool on your phone or spoken to Siri or Alexa? NLP helps convert spoken word to numerical vectors, which computers can understand and process.

Text Classification

This type of NLP is typically referred to as text tagging, assigning a specific sentence or word to an appropriate category such as positive or negative, spam or not spam. It can help with spam filtering, sentiment analysis, topic inference, and hate speech detection!

For example, in digital marketing, a brand's social media account may have hundreds of comments a day. These comments bring valuable insights to the company but can be tedious to sift through. NLP can help identify positive or negative aspects of customers' comments!

Text Generation

Google’s Smart Compose, which helps write emails, is a good example. But text generation goes beyond email composition. Recent models (GTP-3) can write coherent stories across multiple paragraphs paving the way for automated news content creation!

NLP is evolving at breathtaking speed. Recent models can write text indistinguishable from human prose, while automatic translation has become commonplace, and conversational chatbots are increasingly efficient. This course will guide you through the foundations of creating an NLP model. Consider this the first step before going on to making the world’s next new speech detection technology. ^^

Let’s kick things off with some hands-on practice. The goal is to give you a taste of what it’s like to build a natural language processing model. You will also learn about stop words, which will come in handy in the coming chapters!

Visualize Text With a Word Cloud

Let's start by creating a word cloud. You have probably seen then before, but if not, here's a pretty one I found:

This is a wordcloud shaped like a butterfly. The most frequent words appear larger and less frequent words appear smaller.
A word cloud in the shape of a butterfly. Source: https://olliconnects.org/what-are-word-clouds/

word cloud is a snapshot of a text meant to help you explore and understand it at a glance. It is a word image where each word's size is proportional to its importance (more frequent words appear larger). A wordcloud can be particularly useful during a professional presentation, to help get your crowd engaged and draw attention to the main themes! ^^

So, how do I make one?

Set Up Your Environment

This course will work with the latest Python version (3.8 at the time of writing) and the Anaconda Python distribution. Check out the Anaconda site to Learn how to install a Conda-managed Python environment on your local machine.

Install the Word Cloud Library

To generate word clouds in Python, use the word cloud library. You can read more about it on this Python Package Index (PyPI) page. Install it by running:

  • In a terminal: pip install wordcloudor  conda install -c conda-forge wordcloud 

  • In a Google Colab notebook:!pip install wordcloud

Import the Text Material

For this example, we will use content from a Wikipedia page, which you can download as raw text in a few lines of code without the inherent markup. Copy-paste the code below to get started. We will reuse the  wikipedia_page()  function later in the course.

Since we're all earthlings, we will use the article on planet Earth, but I strongly encourage you to experiment with other topics and content as you follow along, such as Harry Potter, dogs,  Star Trek, or coffee.

import requests
def wikipedia_page(title):
'''
This function returns the raw text of a wikipedia page
given a wikipedia page title
'''
params = {
'action': 'query',
'format': 'json', # request json formatted content
'titles': title, # title of the wikipedia page
'prop': 'extracts',
'explaintext': True
}
# send a request to the wikipedia api
response = requests.get(
'https://en.wikipedia.org/w/api.php',
params= params
).json()
# Parse the result
page = next(iter(response['query']['pages'].values()))
# return the page content
if 'extract' in page.keys():
return page['extract']
else:
return "Page not found"
# We lowercase the text to avoid having to deal with uppercase and capitalized words
text = wikipedia_page('Earth').lower()
print(text)

If Wikipedia isn't your thing, you can also go classic with Project Gutenberg. It is a library that holds thousands of free public domain texts that you can download in three lines of code:

import requests
# this is the url for Alice in Wonderland
result = requests.get('http://www.gutenberg.org/files/11/11-0.txt')
print(result.text)

Create the Word Cloud

Let's create a word cloud. Remember, the goal is to understand better what Wikipedia's Earth page is about without reading it.

# import the wordcloud library
from wordcloud import WordCloud
# Instantiate a new wordcloud.
wordcloud = WordCloud(random_state = 8,
normalize_plurals = False,
width = 600, height= 300,
max_words = 300,
stopwords = [])
# Apply the wordcloud to the text.
wordcloud.generate(text)

Then use  matplotlib  to display the word cloud as an image:

import matplotlib.pyplot as plt
# create a figure
fig, ax = plt.subplots(1,1, figsize = (9,6))
# add interpolation = bilinear to smooth things out
plt.imshow(wordcloud, interpolation='bilinear')
# and remove the axis
plt.axis("off")

Ta-da! Your first word cloud! :-°

Wordcloud of the Wikipedia Page for
A word cloud of the Wikipedia Earth page 

What strikes you about this word cloud?

The word cloud tells you that the page is about the Earth, the Moon, and the Sun. Cool, well, you knew that already.

However, notice that most of the words do not give you any information about the text's content. There's a lot of common words that are not very meaningful. Words such as the, ofisareandthatfrom, etc.

Stop words are words that do not provide any useful information to infer the content or nature. It may be either because they aren't meaningful (prepositions, conjunctions, etc.) or too frequent.

Eliminating stop words is the first step when you are preprocessing raw text. You will see how to remove them in the next chapter! Then we will come back to this word cloud and redo it.

Let's Recap!

  • Natural language processing (NLP) lies at the intersection between linguistics and computer science. Its purpose is to transform (process) human language (natural language) into language a machine can understand and use.

  • Some common NLP use-cases include information extraction, text classification, unsupervised exploration, text generation, and many others.

  • word cloud is a snapshot of a text. It helps you explore and understand text at a glance.

  • To generate word clouds in Python, use the word cloud library. You can read more about it on the project description of the Python Package Index (PyPI) page. 

  • Stop words do not provide any useful information to infer a text’s content. They are either prepositions and conjunctions (i.e., the, of, is, are, and, that, from), or non-specific words that occur too frequently.

  • All the scripts in the course are available as Jupyter Notebooks in this GitHub repo.

In the next chapter, we'll start the pre-processing phase of any NLP project by removing stop words.

Example of certificate of achievement
Example of certificate of achievement