So far, you've seen clean text. But in the wild world of online texts and social media, you have to deal with a lot more noise: HTML tags when you scrape a page, emojis in tweets, accents in French, URLs, and emails everywhere, among others. You may need to get rid of these!
Let's start with something simple yet useful, extracting #hashtags from a social media corpus (tweets, Instagram, etc.).
Imagine that you want to extract all of the #hastags from a collection of tweets. More precisely, you want to find all the strings that start with the # sign and between word boundaries such as spaces, tabs, line returns, etc.
To do that, you use the Python
regex library. It enables you to:
Define a string pattern. The pattern can be more or less complex but always precise.
Operate on the strings that match the pattern: search, extract, replace.
The code to find all hashtags from a piece of text goes like this:
# the source texttext = ' _ _ _ _ _ ... '# 1. import the regex libraryimport re# 2. define the patternpattern = r'#\S+'# 3. find all the strings that match the pattern with the findall methodre.findall(pattern, text)
We'll come back to the definition of the pattern
r'#\S+' later. For now, let's apply that code to a collection of three tweets:
An #autumn scene showing a beautiful #horse coming to visit me.
My new favorite eatery in #liverpool! and I mean superb! #TheBrunchClub #breakfast #food.
#nowplaying Pointer Sisters - Dare Me | #80s #disco #funk #radio.
# the corpus of tweetstweets = ['An #autumn scene showing a beautiful #horse coming to visit me.','My new favourite eatery in #liverpool and I mean superb! #TheBrunchClub #breakfast #food','#nowplaying Pointer Sisters - Dare Me | #80s #disco #funk #radio']# and the hashtag extractionimport repattern = r'#\S+'for text in tweets:print(re.findall(pattern, text))
This results in:
['#autumn', '#horse'] ['#liverpool', '#TheBrunchClub', '#breakfast', '#food'] ['#nowplaying', '#80s', '#disco', '#funk', '#radio']
It worked! We extracted all the hashtags from the tweets!
Usernames which start with a @ sign can be extracted in just the same way. You only have to replace the # sign by the @ sign in the definition of the regex.
import retext = 'Check out this new NLP course on @openclassrooms by @alexip'# change the pattern # -> @pattern = r'@\S+'print(re.findall(pattern, text))
That was pretty easy. After all, the pattern only had to recognize the first character of the string.
Identify a Regex
So what is regex?
You've seen two regex definitions so far:
r'#\S+' for #hashtags and
r'@\S+' for @usernames used to extract the hashtags and usernames from tweets.
Once you've defined the pattern, use it to transform the text. The Python
re library includes the following three main functions:
re.findall(pattern, text), which returns the list of strings that match the pattern.
re.sub(pattern, replace_with, text), which replaces string sequences that match the pattern by the replace_with sequence.
re.search(pattern, text), which returns the last matching pattern with information about the starting and ending position of the pattern.
Let's apply the
re.sub function to remove all the HTML tags from an HTML page.
Remove HTML Tags
Say you have downloaded a web page, and you want to pull out the text from the page without all the HTML markup. You can use regex for that by defining a pattern that finds all the strings contained between a < and a >
import requestsimport re# Music is in the House!url = 'https://en.wikipedia.org/wiki/House_music'# GET the content# Note: requests.get().content returns a byte object# that we can cast as string with .decode('UTF-8')html = requests.get(url).content.decode('UTF-8')# remove the header part of the htmlhtml = html.split('</head>')# and remove all the html tagstext = re.sub("<[^>]*>",' ', html)
For instance, you may get:
Cultural origins 1980s, Chicago , Illinois , United States Derivative forms Electroclash Eurobeat techno UK garage speed garage trance dance-pop 2-step garage Detroit techno Subgenres Acid house deep
For a final example, let's extract the URLs in a text using the Wikipedia raw HTML page as content.
URLs all start with the name of the protocol (FTP, HTTP, etc.). We'll stick to the standard full web URLs that start with
Could we try to replace the # in the #hashtag like earlier?
To find URLs, use this slightly more complex pattern:
Let's test it on some HTML content.
import requests, reurl = 'https://en.wikipedia.org/wiki/House_music'# GET, decode and drop headerhtml = requests.get(url).content.decode('UTF-8').split('</head>')# find all the urlspattern = r'http.+?(?="|<)'urls = re.findall(pattern, html)
This returns a list of all the 279 URLs contained in Wikipedia page for House Music. For instance,
You could find or replace all sorts of elements (emails, punctuation signs, numbers, zip codes, phone numbers, etc.) with different patterns.
A Recap of the Main Regex Patterns
Here's a table of useful regex patterns:
list of words
word01, word02, word03...
Build Regex Patterns
As you've probably noticed, we've stayed clear of explaining how to create the regex. There are a couple of reasons for that.
First of all, regex can seem intimidating, and it's easier to start with some out-of-the-box examples. Secondly, building a proper regex pattern can be time-consuming with lots of trial and error. In practice, I usually google what I'm looking for, end up on Stack Overflow, and quickly find the most simple regex available. I then test the regex on some examples, avoiding the task of having to create my own.
It's still handy to know more about the inner working of a regex pattern!
: a set of characters.
a-z: lowercase letters, A-Z uppercase letters, or À-ÖØ-öø-ÿ for accented letters.
digits: \d: digits. Equivalent to [0-9].
\S any character that is not a whitespace character.
\w word characters, including numbers and the underscore.
\s space characters including line returns, tabs, non-breaking space, etc.
+: 1 or more repetitions.
?: 0 or 1 repetition.
*: 0 or more.
\b: empty string, but only at the beginning or end of a word, so a potential word tokenizer can be r'\b\w\b'.
^: from the start of the text.
$: until the end of the text.
There are a number of websites dedicated to building and testing regex patterns. Regex 101, found here, is a good example.
Precompile a Regex Pattern
You've seen cases where the regex is defined as a string and used in the functions
However, it's also possible to precompile the regex pattern with
import repattern = re.compile(r'@\S+')re.findall(pattern, text)
Precompiling the regex will greatly speed up the execution of the function. It is good practice when dealing with large volumes of data.
Regex is a powerful tool used across multiple languages (from C, PHP, Java, Go, Julia , Haskell, or R) and even on the command line. Although there are some slight variations between regex versions, they all share the same pattern definitions.
On the command line, regex is integrated by default in most commands. For example, to extract all emails from many text files in a directory, grep the email pattern to get the list.
> grep 'r'\S*@\S*\s?' files.txt
A regex is a sequence of characters that define a search pattern that can match, locate, and manage text.
You can use pre-defined regex to extract simple text elements, such as usernames or hashtags. In this chapter, you learned some of the most common patterns and how to use them to extract information from a text.
You can also use regex to clean up the text by removing unwanted tags and more complex elements.
Regex is blazing fast and can be used from the command line in most programming languages.
This concludes Part I of the course. In the next part, you'll see how to transform text into numbers in order to use machine learning!