This chapter concerns best practices. There’s nothing technical here, just someone (me) warning you against certain pitfalls!
Specifically, I will be explaining the importance of knowing your tables’ candidate keys.
Why are the Keys so Important?
It is very important to know at least one candidate key per table.
Why? Because these keys determine the nature of the objects the table represents.
Suppose someone sends you a CSV file that contains the following table:
You may say to yourself, "This table represents apples: 1 row = 1 apple. The primary key is, therefore [id_apple]."
Based on this assumption, you calculate the total weight of all of the apples. You find
Except... as you explore the table a little further, you come upon this:
You discover that apple
39 has two owners: Douglas and Dorian.
It never occurred to you that an apple could belong to two people at once! And that changes a lot of things:
The primary key is no longer
[id_apple, owner], because id_apple contains duplicates (here,
39is duplicated in id_apple).
The table you have before you no longer represents apples, but apple-owner associations (see the chapter Using Association Tables). Your assumption that 1 row = 1 apple is no longer valid!
The weight calculation is false because you included the weight of apple 39 twice (and this might be the case for more than just apple 39!). This calculation error costs you your credibility in the eyes of the person who asked you to calculate the total weight. In short, it’s a disaster!
Why are keys so important in joins?
Having a solid knowledge of your keys is also very important with respect to Joins.
Generally, we join:
a foreign key of a table A...
.... to the primary key (or a candidate key) of table B.
When we do so, the resulting table will have:
as many as, or fewer rows than A, if we are performing an inner join
as many rows as A, if we are performing a left outer join (with A on the left).
At the end of your Join, don’t forget to count the number of rows.
As a general rule, in your join condition:
A.key_foreign = B.key_candidate
you need to make sure that at least one of the terms on either side of the equals sign is a candidate or primary key.
If it isn’t, you can still use it, but you must really make sure there is no error in your reasoning. Make sure it’s what you really need. Above all, thoroughly check the final table for consistency and number of rows.
Go Further: Test whether an Attribute Set is a Candidate Key
How can you verify whether or not a set of attributes G is a candidate key?
To find out, create a projection over G (basically, exclude the columns that are not G), then look to see if there are any duplicates.
If there are duplicates, your G is not a key.
If there are no duplicates, then you have two options:
1. If you are sure that no new row will be added, you can consider G a key.
2. If other rows might be added later, you need to contact the designer of the table to see if G is a key, or to see what a row represents.
Yes, but how can I find out whether a table contains duplicates?
Check a Table for Duplicates
If your table contains many rows, it may be difficult to detect duplicates with the naked eye! So here is a recipe you can use in your preferred programming language:
Count the number of rows in your table
Discard the duplicates
Re-count the number of rows in your table.
If the number of rows in step 1 is different from the number of rows in step 3, your table contains duplicates!
Many data analysis languages make it easy for you to eliminate duplicates from a table:
For each new table (that you generate yourself, or that you come across), make sure you know at least one candidate key! This will tell you what a row represents.
There is a method for testing whether a set of attributes is a key or not.
After a join, always count the number of rows obtained.
When you perform a join on something other than a foreign key, you really need to pay attention to what you’re doing!