In a database, two different character strings can often refer to the same information.
For example, the character strings ‘Mary-Louise'
and ‘Mary Louise’
refer to the same name, but the strings differ by a hyphen.
The LIKE Operator
To deal with such situations, SQL provides a very useful operator: LIKE
.
The LIKE operator allows you to search character strings. It is used when you know only part of a character string, or only its general form.
To search a string using LIKE
, two symbols can be used: the underscore sign _
and the percentage sign %
. The underscore sign represents any single unknown character. The percentage sign is used to represent 0, 1 or any unknown sequence of characters.
For example, to find all of the companies whose names begin with the letter A
, we can write this:
SELECT * FROM entity WHERE name LIKE 'A%' ;
A Little Exercise!
In this table, you will find different expressions using LIKE
(left column), and the value returned by the expression in question (right column).
Expression | Result |
| |
| |
| |
| |
|
|
| |
| |
| |
Upper and Lower Case
What does the expression'OpenClassrooms' LIKE 'openclassrooms'
mean?
Well, it depends! RDBMSes don’t all behave in the same way. Some will return TRUE
, others FALSE
.
So, to avoid error, I recommend this approach:
SELECT * FROM entity WHERE lower(name) LIKE 'a%' ;
This way, the name will be converted to lowercase, regardless of the value thanks to the lower
function. Once it is converted, write your pattern with no uppercase letters (here, the pattern is 'a%'
).
This query will return all of the companies whose name begins with A
, regardless of case.
Duplicates
In the Panama Papers, a company name can be spelled in different ways.
For example, we found an intermediary named Pacher Banking S.A.
Let’s search to see if there are any duplicates:
SELECT * FROM intermediary WHERE lower(name) LIKE '%pacher%banking%' ;
There are duplicates! Two rows have a similar name. Do they correspond to the same company? Since they have the same address, we will assume here that they do (even though this is not entirely certain).
We can therefore adapt the query we saw in the previous chapter, adding the LIKE
operator, as follows: :
SELECT
i.id AS intermediary_id,
i.name AS intermediary_name,
e.jurisdiction,
e.jurisdiction_description,
count(*) as cnt
FROM
intermediary i,
assoc_inter_entity a,
entity e
WHERE
a.entity = e.id AND
a.inter = i.id AND
(i.id = 5000 OR i.id = 5001)
GROUP BY
i.id, i.name, e.jurisdiction, e.jurisdiction_description
ORDER BY
cnt DESC
LIMIT 5;
Go Further: Managing Duplicates in the Panama Papers
The designers of the Panama Papers database were aware that it contained many duplicates, but they were careful enough not to wrongly conflate two companies that had the same name. So they provided additional features to guide us.
They connected certain rows of the entity, intermediary, and officers tables using links such as "same name as" and "same address as".
You can find these links in the assoc_intermediaries, assoc_entities, and assoc_officers tables.
Summary
To search inexact character strings, use
LIKE
.Specify patterns using the underscore sign
_
to replace any single unknown character, and the percentage sign%
to replace any unknown sequence of characters (or no character).