• 20 hours
  • Medium

Free online content available in this course.

course.header.alt.is_video

course.header.alt.is_certifying

Got it!

Last updated on 8/29/24

Search character strings using LIKE

In a database, two different character strings can often refer to the same information.

For example, the character strings  ‘Mary-Louise'  and  ‘Mary Louise’  refer to the same name, but the strings differ by a hyphen.

The LIKE Operator

To deal with such situations, SQL provides a very useful operator:  LIKE  .

The LIKE operator allows you to search character strings. It is used when you know only part of a character string, or only its general form.

To search a string using  LIKE  , two symbols can be used: the underscore sign  _  and the percentage sign  %  . The underscore sign represents any single unknown character. The percentage sign is used to represent 0, 1 or any unknown sequence of characters.

For example, to find all of the companies whose names begin with the letter  A  , we can write this:

SELECT * FROM entity WHERE name LIKE 'A%' ;

A Little Exercise!

In this table, you will find different expressions using  LIKE  (left column), and the value returned by the expression in question (right column).

Expression

Result

'OpenClassrooms' LIKE '%Class%' 

 TRUE

'OpenClassrooms' LIKE '%Class%ms'

 TRUE

'OpenClassrooms' LIKE '%Class%ms%'

 TRUE

'OpenClassrooms' LIKE 'Open_lassrooms'

 TRUE

'OpenClassrooms' LIKE 'Open__lassrooms'

FALSE

'OpenClassrooms' LIKE '_OpenClassrooms'

 FALSE

'OpenClassrooms' LIKE 'Op__Cla%'

 TRUE

'OpenClassrooms' LIKE '%OpenClas%srooms%'

 TRUE

Upper and Lower Case

What does the expression'OpenClassrooms' LIKE 'openclassrooms'  mean?

Well, it depends! RDBMSes don’t all behave in the same way. Some will return  TRUE  , others  FALSE  .:'( 

So, to avoid error, I recommend this approach:

SELECT * FROM entity WHERE lower(name) LIKE 'a%' ;

This way, the name will be converted to lowercase, regardless of the value thanks to the  lower  function. Once it is converted, write your pattern with no uppercase letters (here, the pattern is  'a%'  ).

This query will return all of the companies whose name begins with  A  , regardless of case.

Duplicates

In the Panama Papers, a company name can be spelled in different ways.

For example, we found an intermediary named Pacher Banking S.A.

Let’s search to see if there are any duplicates:

SELECT * FROM intermediary WHERE lower(name) LIKE '%pacher%banking%' ;

There are duplicates! Two rows have a similar name. Do they correspond to the same company? Since they have the same address, we will assume here that they do (even though this is not entirely certain).

We can therefore adapt the query we saw in the previous chapter, adding the  LIKE  operator, as follows: :

SELECT 
    i.id AS intermediary_id,
    i.name AS intermediary_name,
    e.jurisdiction,
    e.jurisdiction_description,
    count(*) as cnt
FROM 
    intermediary i,
    assoc_inter_entity a,
    entity e
WHERE 
    a.entity = e.id AND 
    a.inter = i.id AND 
    (i.id = 5000 OR i.id = 5001) 
GROUP BY 
    i.id, i.name, e.jurisdiction, e.jurisdiction_description
ORDER BY 
    cnt DESC
LIMIT 5;

Go Further:  Managing Duplicates in the Panama Papers

The designers of the Panama Papers database were aware that it contained many duplicates, but they were careful enough not to wrongly conflate two companies that had the same name. So they provided additional features to guide us.

They connected certain rows of the entity, intermediary, and officers tables using links such as "same name as" and "same address as".

You can find these links in the assoc_intermediaries, assoc_entities, and assoc_officers tables.

Summary

  • To search inexact character strings, use  LIKE  .

  • Specify patterns using the underscore sign  _  to replace any single unknown character, and the percentage sign  %  to replace any unknown sequence of characters (or no character).

Example of certificate of achievement
Example of certificate of achievement