For this extra chapter, we go back to basics and look at words and their spelling patterns.
So far, you’ve seen clean text. But in the world of online texts and social media, you have to deal with a lot more noise: HTML tags when you scrape a page, emojis in tweets, accents in French and Spanish, URLs, and emails, among others. You may need to get rid of these.
The goal of this chapter is to use specific spelling patterns to identify words that adhere to that pattern. Examples of patterns would be:
A hashtag: a word that starts with a # (#abcdef).
An email: a string that includes the @ sign and some characters before and after (abcdef@ghij).
An HTML tag: a word from a list (
a
,p
,div
) preceded by<
and followed by>
(<div id = ''>
).
Defining spelling patterns and finding matching words is done with regular expressions for short regex. All computer languages have a regex module (and it’s blazingly fast).
So what is regex? 🤔
In Python, you use the regex
library, re
, to:
Define string patterns.
Operate on the strings that match the pattern: search, extract, replace, remove.
Let’s start with something simple yet useful, extracting #hashtags from a social media corpus (tweets, Instagram, etc.).
Extract Hashtags
Hashtags are prevalent on many social media platforms (Instagram, TikTok, Mastodon, etc.). Imagine that you want to extract all #hashtags from a collection of posts. More precisely, you want to find all the strings that start with the # sign, which are between word boundaries, such as spaces, tabs, line returns, etc.
The code to find all hashtags from a piece of text goes like this:
# the source text
text = ' _ _ _ _ _ ... '
# 1. import the regex library
import re
# 2. define the pattern
pattern = r'#\S+'
# 3. find all the strings that match the pattern with the findall method
re.findall(pattern, text)
We’ll come back to the definition of the pattern r'#\S+'
later. For now, let’s apply that code to a collection of three posts:
An #autumn scene showing a beautiful #horse coming to visit me.
My new favorite eatery in #liverpool! And I mean superb! #TheBrunchClub #breakfast #food.
#nowplaying Pointer Sisters - Dare Me | #80s #disco #funk #radio.
# the corpus of tweets
posts = [
'An #autumn scene showing a beautiful #horse coming to visit me.',
'My new favourite eatery in #liverpool Aand I mean superb! #TheBrunchClub #breakfast #food',
'#nowplaying Pointer Sisters - Dare Me | #80s #disco #funk #radio']
# and the hashtag extraction
import re
pattern = r'#\S+'
for text in posts:
print(re.findall(pattern, text))
This results in:
['#autumn', '#horse'] ['#liverpool', '#TheBrunchClub', '#breakfast', '#food'] ['#nowplaying', '#80s', '#disco', '#funk', '#radio']
It worked! We extracted all the hashtags from the posts!
Extract @Usernames
You can extract usernames that start with a @ sign the same way. You only have to replace the # sign with the @ sign in the regex definition.
import re
text = 'Check out this new NLP course on @openclassrooms by @alexip'
# change the pattern # -> @
pattern = r'@\S+'
print(re.findall(pattern, text))
You get:
['@openclassrooms', '@alexip']
That was pretty easy. After all, the pattern only had to recognize the string’s first character.
So far, you’ve seen two regex definitions: r'#\S+'
for #hashtags and r'@\S+'
for @usernames used to extract the hashtags and usernames from posts.
The Python re
library includes the following three main functions to extract specific strings or modify a text:
re.findall(pattern, text)
returns the list of strings matching the pattern.re.sub(pattern, replace_with, text)
replaces string sequences that match the pattern by thereplace_with
sequence.re.search(pattern, text)
, which returns the last matching pattern with information about the starting and ending position of the pattern.
Let’s apply the re.sub
function to remove all the HTML tags from an HTML page.
Remove HTML Tags
Say you have downloaded a web page and want to pull out the text without all the HTML markup. You can use regex by defining a pattern that finds all the strings contained between a < and a > : r'<[^>]*>'
.
import requests
import re
# Music is in the House!
url = 'https://en.wikipedia.org/wiki/House_music'
# GET the content
# Note: requests.get().content returns a byte object
# that we can cast as string with .decode('UTF-8')
html = requests.get(url).content.decode('UTF-8')
# remove the header part of the html
html = html.split('</head>')[1]
# and remove all the html tags
text = re.sub("<[^>]*>",' ', html)
For instance, you may get: print(text[2009:2200])
.
characterized by a repetitive four-on-the-floor beat and a typical tempo of 120 beats per minute. [10] It was created by DJs and music producers from Chicago 's underground
With slight modification, you could also use the the above pattern, r'<[^>]*>'
to remove:
Latex inline equations that start and end with
$
:r'\$[^>]*\$'
(you just need to escape the $ sign by adding a slash \ sign before the $ sign: \$).Text between brackets such as [music] or [clapping], which can sometimes be found in captions
r'\[[^>]*\]'
.
Extract URLs
For a final example, let’s extract the URLs in a text using the Wikipedia raw HTML page as content.
Could we re-use the pattern we defined for hashtags by simply replacing the # sign by http, r'http\S+'
?
Unfortunately, that would not work quite as well. Hashtags (and @usernames) are usually followed by spaces, tabs, or line returns. In a raw HTML page, the end of a URL can be a double quote " or an end tag character >. You need to specify the pattern so that it also knows to look for ending characters (",>,; ), not just starting ones (#, @, http).
You use this slightly more complex pattern to find URLs: r'http.+?(?="|<'
.
Let’s test it on some HTML content.
import requests, re
url = 'https://en.wikipedia.org/wiki/House_music'
# GET, decode and drop header
html = requests.get(url).content.decode('UTF-8').split('</head>')[1]
# find all the urls
pattern = r'http.+?(?="|<)'
urls = re.findall(pattern, html)
This returns a list of all the 279 URLs contained in the Wikipedia page for House Music. For instance, urls[100]
:
'https://www.theguardian.com/music/2010/apr/10/charanjit-singh-acid-house'
You could find or replace all sorts of elements (emails, punctuation signs, numbers, zip codes, phone numbers, etc.) with different patterns.
A Recap of the Main Regex Patterns
Here’s a table of useful regex patterns:
target element | string pattern | regex |
#hashtags | #------ | |
@usernames | @----- | |
emails | ---@--- |
|
urls | http---- |
|
list of words | word01, word02, word03... | |
punctuation | ,.:;'"[]{} | |
digits | 01234567890 |
|
html tags | <---> | |
inline latex | $---$ | |
Build Regex Patterns
As you’ve probably noticed, we’ve mostly stayed clear of explaining how to create the regex. There are a couple of reasons for that.
First, regex can seem intimidating, and it’s easier to start with some straightforward examples. Secondly, building a proper regex pattern can be time-consuming with lots of trial and error. In practice, I usually google what I’m looking for, end up on Stack Overflow, and quickly find the most simple regex available. I then test the regex on some examples, avoiding having to create my own.
It’s still handy to know more about the inner workings of a regex pattern!
Components:
[]: a set of characters.
a-z: lowercase letters, A-Z uppercase letters, or À-ÖØ-öø-ÿ for accented letters.
digits: \d: digits. Equivalent to [0-9].
\S any character that is not a whitespace character.
\w word characters, including numbers and the underscore.
\s space characters including line returns, tabs, non-breaking space, etc.
Repetition:
+: 1 or more repetitions.
?: 0 or 1 repetition.
*: 0 or more.
Boundaries:
\b: empty string, but only at the beginning or end of a word, so a potential word tokenizer can be r'\b\w\b'.
^: from the start of the text.
$: until the end of the text.
There are several websites dedicated to building and testing regex patterns. Regex 101, found here, is a good example.
As an example, let’s look at the first pattern we used to extract #hashtags pattern = r'#\S+'
:
r: indicates that the string between '' is a regex.
#\S+ the pattern that indicates we want to extract #hashtags. Find any strings that:
#: start with the # sign
\S: followed by any characters except word boundaries (space, tabs, etc.).
+: find strings consisting of one or more of the above (\S) characters.
Why are the regex patterns defined with a string preceded by the letter r?
You create a raw string by placing an r before the string. In a raw string, escaped characters, such as line return \n, are not interpreted, which is why a raw string is preferred when declaring a regex pattern. To see the difference between a string and a raw string, compare print('\n')
(will print a line return ) and print(r'\n')
(will print \n).
Regex is a powerful tool used across multiple languages (from C, PHP, Java, Go, Julia, Haskell, or R) and even on the command line. Although there are some slight variations between regex versions, they all share the same pattern definitions.
On the command line, regex is integrated by default in most commands. For example, to extract all emails from a text file, grep the email pattern to get the list.
> grep 'r'\S*@\S*\s?' my_file.txt
Let’s Recap!
A regex is a sequence of characters defining a search pattern that matches, locates, and manages text.
You can use predefined regex to extract simple text elements, such as usernames, emails, or hashtags. In this chapter, you learned some of the most common patterns and how to use them to extract information from a text.
You can also use regex to clean up the text by removing unwanted tags and more complex elements.
Regex is blazingly fast and can be used from the command line and in most programming languages.
Okay, now it’s really done. Good luck with your future NLP adventures!