In a database, two different character strings can often refer to the same information.
For example, the character strings
‘Mary Louise’ refer to the same name, but the strings differ by a hyphen.
The LIKE Operator
To deal with such situations, SQL provides a very useful operator:
The LIKE operator allows you to search character strings. It is used when you know only part of a character string, or only its general form.
To search a string using
LIKE , two symbols can be used: the underscore sign
_ and the percentage sign
% . The underscore sign represents any single unknown character. The percentage sign is used to represent 0, 1 or any unknown sequence of characters.
For example, to find all of the companies whose names begin with the letter
A , we can write this:
SELECT * FROM entity WHERE name LIKE 'A%' ;
A Little Exercise!
In this table, you will find different expressions using
LIKE (left column), and the value returned by the expression in question (right column).
Upper and Lower Case
What does the expression
'OpenClassrooms' LIKE 'openclassrooms' mean?
Well, it depends! RDBMSes don’t all behave in the same way. Some will return
TRUE , others
So, to avoid error, I recommend this approach:
SELECT * FROM entity WHERE lower(name) LIKE 'a%' ;
This way, the name will be converted to lowercase, regardless of the value thanks to the
lower function. Once it is converted, write your pattern with no uppercase letters (here, the pattern is
This query will return all of the companies whose name begins with
A , regardless of case.
In the Panama Papers, a company name can be spelled in different ways.
For example, we found an intermediary named Pacher Banking S.A.
Let’s search to see if there are any duplicates:
SELECT * FROM intermediary WHERE lower(name) LIKE '%pacher%banking%' ;
There are duplicates! Two rows have a similar name. Do they correspond to the same company? Since they have the same address, we will assume here that they do (even though this is not entirely certain).
We can therefore adapt the query we saw in the previous chapter, adding the
LIKE operator, as follows: :
SELECTi.id AS intermediary_id,i.name AS intermediary_name,e.jurisdiction,e.jurisdiction_description,count(*) as cntFROMintermediary i,assoc_inter_entity a,entity eWHEREa.entity = e.id ANDa.inter = i.id AND(i.id = 5000 OR i.id = 5001)GROUP BYi.id, i.name, e.jurisdiction, e.jurisdiction_descriptionORDER BYcnt DESCLIMIT 5;
Go Further: Managing Duplicates in the Panama Papers
The designers of the Panama Papers database were aware that it contained many duplicates, but they were careful enough not to wrongly conflate two companies that had the same name. So they provided additional features to guide us.
They connected certain rows of the entity, intermediary, and officers tables using links such as "same name as" and "same address as".
You can find these links in the assoc_intermediaries, assoc_entities, and assoc_officers tables.
To search inexact character strings, use
Specify patterns using the underscore sign
_to replace any single unknown character, and the percentage sign
%to replace any unknown sequence of characters (or no character).
A Little Exercise!
The database is online. Get some practice by rewriting the queries found in this chapter. Don’t hesitate to change them, play with them, and “see what happens.”