• 10 hours
  • Hard

Free online content available in this course.

course.header.alt.is_certifying

Got it!

Last updated on 1/28/21

Extract Information With Regular Expression

Log in or subscribe for free to enjoy all this course has to offer!

So far, you've seen clean text. But in the wild world of online texts and social media, you have to deal with a lot more noise: HTML tags when you scrape a page, emojis in tweets, accents in French, URLs, and emails everywhere, among others. You may need to get rid of these!

Let's start with something simple yet useful, extracting #hashtags from a social media corpus (tweets, Instagram, etc.).

Extract Hashtags

Imagine that you want to extract all of the #hastags from a collection of tweets. More precisely, you want to find all the strings that start with the # sign and between word boundaries such as spaces, tabs, line returns, etc.

To do that, you use the Python  regex  library. It enables you to:

  • Define a string pattern. The pattern can be more or less complex but always precise.

  • Operate on the strings that match the pattern: search, extract, replace.

The code to find all hashtags from a piece of text goes like this:

# the source text
text = ' _ _ _ _ _ ... '
# 1. import the regex library
import re
# 2. define the pattern
pattern = r'#\S+'
# 3. find all the strings that match the pattern with the findall method
re.findall(pattern, text)

We'll come back to the definition of the pattern r'#\S+' later. For now, let's apply that code to a collection of three tweets:

  • An #autumn scene showing a beautiful #horse coming to visit me.

  • My new favorite eatery in #liverpool! and I mean superb! #TheBrunchClub #breakfast #food.

  • #nowplaying Pointer Sisters - Dare Me | #80s #disco #funk #radio.

# the corpus of tweets
tweets = [
'An #autumn scene showing a beautiful #horse coming to visit me.',
'My new favourite eatery in #liverpool and I mean superb! #TheBrunchClub #breakfast #food',
'#nowplaying Pointer Sisters - Dare Me | #80s #disco #funk #radio']
# and the hashtag extraction
import re
pattern = r'#\S+'
for text in tweets:
print(re.findall(pattern, text))

This results in:

['#autumn', '#horse']
['#liverpool', '#TheBrunchClub', '#breakfast', '#food']
['#nowplaying', '#80s', '#disco', '#funk', '#radio']

It worked! We extracted all the hashtags from the tweets!

Extract @Usernames

Usernames which start with a @ sign can be extracted in just the same way. You only have to replace the # sign by the @ sign in the definition of the regex.

import re
text = 'Check out this new NLP course on @openclassrooms by @alexip'
# change the pattern # -> @
pattern = r'@\S+'
print(re.findall(pattern, text))

You get:

['@openclassrooms', '@alexip']

That was pretty easy. After all, the pattern only had to recognize the first character of the string.

Identify a Regex

So what is regex? o_O

You've seen two regex definitions so far: r'#\S+'  for #hashtags and  r'@\S+'  for @usernames used to extract the hashtags and usernames from tweets.

Once you've defined the pattern, use it to transform the text. The Python re library includes the following three main functions:

  • re.findall(pattern, text), which returns the list of strings that match the pattern. 

  • re.sub(pattern, replace_with, text), which replaces string sequences that match the pattern by the replace_with sequence. 

  • re.search(pattern, text), which returns the last matching pattern with information about the starting and ending position of the pattern. 

Let's apply the re.sub function to remove all the HTML tags from an HTML page.

Remove HTML Tags

Say you have downloaded a web page, and you want to pull out the text from the page without all the HTML markup. You can use regex for that by defining a pattern that finds all the strings contained between a < and a >  :r'<[^>]*>'  .

import requests
import re
# Music is in the House!
url = 'https://en.wikipedia.org/wiki/House_music'
# GET the content
# Note: requests.get().content returns a byte object
# that we can cast as string with .decode('UTF-8')
html = requests.get(url).content.decode('UTF-8')
# remove the header part of the html
html = html.split('</head>')[1]
# and remove all the html tags
text = re.sub("<[^>]*>",' ', html)

 For instance, you may get:  print(text[540:800])  .

Cultural origins  1980s,  Chicago ,  Illinois , United States    Derivative forms      Electroclash    Eurobeat    techno    UK garage    speed garage    trance    dance-pop    2-step garage    Detroit techno        Subgenres        Acid house    deep

Extract URLs

For a final example, let's extract the URLs in a text using the Wikipedia raw HTML page as content.

URLs all start with the name of the protocol (FTP, HTTP, etc.). We'll stick to the standard full web URLs that start with  http  .

Could we try to replace the # in the #hashtag like earlier? 

To find URLs, use this slightly more complex pattern: r'http.+?(?="|<)'

Let's test it on some HTML content.

import requests, re
url = 'https://en.wikipedia.org/wiki/House_music'
# GET, decode and drop header
html = requests.get(url).content.decode('UTF-8').split('</head>')[1]
# find all the urls
pattern = r'http.+?(?="|<)'
urls = re.findall(pattern, html)

 This returns a list of all the 279 URLs contained in Wikipedia page for House Music. For instance,  urls[138]:

'https://www.electronicbeats.net/juan-atkins-about-kraftwerk/'

You could find or replace all sorts of elements (emails, punctuation signs, numbers, zip codes, phone numbers, etc.) with different patterns.

A Recap of the Main Regex Patterns

Here's a table of useful regex patterns:

target element

string pattern

regex

#hashtags

#------

 r'#\S+'

@usernames

@-----

 r'@\S+'

emails

---@---

 r'\S*@\S*\s?'

urls

http----

 r'http.+?(?="|<)'

list of words

word01, word02, word03...

 r'word01|word02|word03'

punctuation

,.:;'"[]{}

 r'[^A-Za-z0-9]'

digits

01234567890

 f'\d+'

html tags

<--->

 r'<[^>]*>'

inline latex

$---$

 r'\$[^>]*\$' 

Build Regex Patterns

As you've probably noticed, we've stayed clear of explaining how to create the regex. There are a couple of reasons for that.

First of all, regex can seem intimidating, and it's easier to start with some out-of-the-box examples. Secondly, building a proper regex pattern can be time-consuming with lots of trial and error. In practice, I usually google what I'm looking for, end up on Stack Overflow, and quickly find the most simple regex available. I then test the regex on some examples, avoiding the task of having to create my own.

It's still handy to know more about the inner working of a regex pattern!

Components:
  • []: a set of characters.

  • a-z: lowercase letters, A-Z uppercase letters, or À-ÖØ-öø-ÿ for accented letters.

  • digits: \d: digits. Equivalent to [0-9].

  • \S any character that is not a whitespace character.

  • \w word characters, including numbers and the underscore.

  • \s space characters including line returns, tabs, non-breaking space, etc.

Repetition:
  • +: 1 or more repetitions.

  • ?: 0 or 1 repetition.

  • *: 0 or more.

Boundaries:
  • \b: empty string, but only at the beginning or end of a word, so a potential word tokenizer can be r'\b\w\b'.

  • ^: from the start of the text.

  • $: until the end of the text.

 There are a number of websites dedicated to building and testing regex patterns. Regex 101, found here, is a good example. 

Precompile a Regex Pattern

You've seen cases where the regex is defined as a string and used in the functions   sub  and  findall  . 

However, it's also possible to precompile the regex pattern with re.compile(string) :

import re
pattern = re.compile(r'@\S+')
re.findall(pattern, text)

Precompiling the regex will greatly speed up the execution of the function. It is good practice when dealing with large volumes of data.

Regex is a powerful tool used across multiple languages (from C, PHP, Java, Go, Julia , Haskell, or R) and even on the command line. Although there are some slight variations between regex versions, they all share the same pattern definitions.

On the command line, regex is integrated by default in most commands. For example, to extract all emails from many text files in a directory, grep the email pattern to get the list.

> grep 'r'\S*@\S*\s?' files.txt 

Let's Recap!

  • regex is a sequence of characters that define a search pattern that can match, locate, and manage text.

  • You can use pre-defined regex to extract simple text elements, such as usernames or hashtags. In this chapter, you learned some of the most common patterns and how to use them to extract information from a text.

  •  You can also use regex to clean up the text by removing unwanted tags and more complex elements.

  • Regex is blazing fast and can be used from the command line in most programming languages. 

This concludes Part I of the course. In the next part, you'll see how to transform text into numbers in order to use machine learning! 

Example of certificate of achievement
Example of certificate of achievement