• 8 hours
  • Medium

Free online content available in this course.

course.header.alt.is_video

course.header.alt.is_certifying

Got it!

Last updated on 5/5/22

Use Normalization to Improve Your Model

There are always multiple ways of modeling data, and some are better than others. One of the first criteria for assessing quality is redundancy in the model. A set of criteria was created, database normalization, to help assess the quality of two different data models.

Up to now, we have talked a lot about redundancy, but without defining the rules. Database normalization will fill this gap. 

Learn the Notion of Dependency

Let’s use the CSV file to look at redundancy:

Extract from CSV file
Extract from CSV file

You can see redundancy in the type of production. For example, data indicates that the feature film Nine Months, produced by 1492 Pictures, appears several times. Why? Because the filming type is only dependent on the production. However, many rows in the file reference the same production, which means the data is duplicated.

Saying that “the filming type is only dependent on the production” means that if you see Nine Months by 1492 Pictures in the file, the filming type must be  FeatureFilm  .

What do you need to identify a production in your file? We already mentioned that you identify a production using its title and production company, i.e.,  (title, production_company)  .

So,  filming_type  is functionally dependent on the set of attributes   (title, production_company  ), known as functional dependency, indicated using an arrow:

(title, production_company)  ->  filming_type

But hold on. The attribute set (title, production_company) isn’t unique in your file as several rows relate to the same production. If the same production appears several times (due to the functional dependency), several rows will indicate the production type for a given production. And that’s where you have redundancy.

So, if you want to avoid redundancy, you need  (title, production_company)  to be unique. Therefore, this set of attributes will need to be the primary key (or at least a candidate key). Here’s what you can conclude from this:

An added detail: there is still a risk of redundancy if an attribute is dependent on part of the primary key. Imagine a table with a  (title, production_company)  primary key, and a  production_company_address  attribute to hold the production company’s address.

production_company_address  is uniquely dependent on  production_company  . Since  production_company  isn't unique (because it is only part of the primary key), there is a risk of redundancy: 

Title [PK]

Production company [PK]

Filming Type

Nine Months

1492 Pictures

FeatureFilm

Casualties of War

Columbia Picture Corporation

FeatureFilm

Silicon Valley

Brown Hill Productions, LLC

TVSeries

This poses a risk of redundancy.

You can add the following to the previous sentence:

Understand How to Define Dependency

But how can we be sure that an attribute A is dependent on a set of attributes S?

Imagine that you know the data you are modeling very well. Then, ask yourself this question: if you only have S, can you determine A with certainty, with no ambiguity, and without further information?

For example:

Can you determine the filming type (TV film, TV series, etc.) for a production if you only know the production company? The answer is no because one production company can produce several different types of productions. It’s ambiguous, and you need more information – i.e., the production’s title – to resolve it. With the title and the production company, you can determine the filming type if you know the data back to front. So, you can say that:

  • filming_type  is not uniquely dependent on  production_company  .

  • filming_type  is dependent on  (title, production_company)  .

Ah, but if you know the title, production company, and location of a given filming event, you can also determine the  filming_type  . So can you say that  filming_type  is dependent on   (title, production_company, scene_location)  ?

Yes, but it’s not particularly interesting because   (title, production_company, scene_location)    isn't minimal, meaning that you can take away one of the attributes (  scene_location  ) without breaking the dependency. You can say that:

  • filming_type  is dependent on part of   (title, production_company, scene_location)  .

Work Out the First Three Normal Forms

The process you just saw is called the norm forms and has three rules you should follow. There are more than three, but generally, a model that conforms with the first three forms avoids most data redundancy.

1NF

The first normal form (1NF) is a prerequisite for the following forms. Regarding primary keys, you must ensure that:

  • Each table has a primary key.

  • Each attribute only holds one piece of information at a time (otherwise, it can get messy to check the redundancy and uniqueness of an attribute). 

In other words, each attribute must be atomic, i.e., it can’t be multivalued or composite (to refresh your memory on these two terms, look at chapter 1, part 2: “Lay the Foundations for Your Class Diagram”).

composite attribute

 

atomic attributes

  • address

=>

  • property_number

  • suffix

  • street_name

  • town_city

  • postcode

  • country

Let’s look at an example.

In your file, the  director  is multivalued because one cell can hold several directors’ names.

You considered this in your relational model because you transformed this multivalued attribute into a new table,  director  , containing a name attribute, which is atomic (and that's what you want). This new table is linked to the  production  table using a many-to-many association:

Many-to-many association between production and director
Many-to-many association between production and director

2NF

The second normal form will flow naturally from the thought process you followed above:

A table is in second normal form (2NF) if:

  • It’s in 1NF.

  • Every attribute (that doesn't belong to a candidate key) is not dependent on only one part of a candidate key.

3NF

A relation is in third normal form (3NF) if:

  • It’s in 2NF.

  • Every attribute (that doesn't belong to a candidate key) is not dependent on another attribute that doesn't belong to a candidate key.

Learn to Comply With the Three Normal Forms

To comply with 1NF, you need to:

  • Choose a primary key for each table.

  • Transform multivalued attributes into a new table linked to the original one using a one-to-many or many-to-many association.

  • Break the composite attributes into atomic attributes (e.g., deconstruct Ms. Jane Smith into three attributes: title, first name, last name).

You need to study the dependencies between attributes for 2NF and 3NF.

There is a rule to separate the attributes of one table to reduce redundancy without losing any information to help with this. Here it is:

Also, if another attribute (B) is uniquely dependent on S, you must move the attribute to the new table (T2).

For example, in your CSV file, you’ve seen that  filming_type  is dependent on  (title, production_company)  . This attribute set is also minimal. You can apply the above rule to say:

  • T1 = your CSV file.

  • A =    filming_type  .

  • S =    (title, production_company)  .

Following this rule:

  1. Create T2 with a primary key of  (title, production_company  ), which holds   filming_type  as an attribute. 

  2. Delete the  filming_type  column from the big CSV file.

  3. Indicate that  (title, production_company)  is a foreign key referencing T2.

And surprise! You notice that T2 exactly matches the  production  table that you previously defined!

title [PK]

production_company [PK]

filming_type

Nine Months

1492 Pictures

FeatureFilm

Casualties of War

Columbia Picture Corporation

FeatureFilm

Silicon Valley

Brown Hill Productions, LLC

TVSeries

This is a good sign. It means you were on the right track when you defined your relational model without realizing it!

Would you have arrived at the same relational model if you’d followed this process to its conclusion (i.e., by breaking the large CSV file down bit by bit into separate tables)?

You’ll be able to think about your answer in just a few seconds. And actually, you’re going to apply this process in the next chapter when you separate your CSV file into different tables.

Separate Your File

Consider your CSV file as one table and then split it into several by analyzing the dependencies between the attributes. Don’t forget to apply the above rules.

But before starting, you need to determine the primary key of your big file.

Easy! The  scene_id  identifies each row, so this is the primary key.

That is essentially correct, but it’s not going to help much! If you use this logic, you will fall into the trap mentioned at the end of the “Determine Your Primary Keys” chapter. As a reminder, this trap involves considering an artificial key (e.g.,  scene_id  ) as a simple solution without really thinking about the uniqueness constraint of the other attributes.

This primary key will tell you about each row in the table.

Instinctively, you could have said, “The  scene_id  column identifies each row.” Therefore, each row represents a location. But this isn’t true as sometimes multiple rows correspond to the same location. For example, the TV series  Looking  and the feature film  Diary of a Teenage Girl  both filmed scenes at  770 Haight St  . There are fewer locations than rows in your file.

The  scene_id  is not a location identifier. Instead, it identifies the location of a production’s filming session for a given period. In other words, it’s the identifier of an association between:

  • A location (identified using  location  and  zip code  ).

  • A filming session (identified using  start_date,end_date  ) for a production (identified using  title  and  production_company  ).

So, you can determine that one of the possible primary keys is  (location, zip_code, start_date, end_date, title, production_company)   .

Based on this information, you can move any attribute that isn't dependent on this primary key or that only depends on part of it into a new table.

Now it's Your Turn!

You have everything you need to create a relational model based on the columns of your file.

By way of an answer, here are the dependencies between the attributes:

  1. (   title  ,    production_company  ) ->    filming_type  . 

  2. (    start_date  ) ->   filming_year  . This attribute is not useful because it's derived.

  3. (   title  ,   production_company  ,  start_date, end_date  ) →  director  .

With points 1 and 2, everything aligns with what you worked out before this chapter. But point 3 is different. You can see that a director is not solely dependent on the production in question but also the filming session.

Directors can be replaced during filming, which is often true for series, where a different person could direct each episode.

For example, Sara Gran and Carl Franklin (among others) directed the series Chance.

Consider the relational model from before the previous chapter. There was no way of knowing who was directing for a given period. In other words, even if you knew the directors of Chance, you had no way of knowing which episodes were directed by whom.

To resolve this, apply the above rule and create a table with a primary key of (    title  ,  production_company  ,  start_date  ,  end_date  ) containing one attribute, director. This table gives you the filming periods for a given production (but with no indication of the location). Call it  shooting_period  .

In addition, the director column is multivalued because some rows contain a pair of directors. As mentioned earlier in this chapter, you can break it down into a new table called  director  , which is linked to  shooting_period  by a many-to-many association, as follows:

New table shooting_period linked to director
New table shooting_period linked to director

When you've created all of your separate tables, you’ll have:

  • location_id  .

  • A foreign key to the  location  table, composed of two columns.

  • A foreign key to the  filming_period  table, consisting of four columns.

You can delete  location_id  as it no longer has a purpose. This table forms an association table between location and  shooting_period  , which is handy because it indicates a many-to-many relationship. However:

  • One filming session can take place during the same period in two different locations (with two different teams or with one team that moves during the day).

  • One location can also host separate filming events during the same period.

So, let’s call this table  assc_period_loc  .

Yay! You now have your final relational model, which is in 3NF!

The final relational model!
The final relational model!
Updated UML diagram
Updated UML diagram

Let’s Recap!

  • Some attributes can be dependent on others.

  • To (mostly) avoid redundancy in a table, each attribute (that doesn’t belong to the primary key) must only be dependent on the table’s primary key. Each attribute must also be dependent on the whole primary key.

  • A table is in 1NF if there is a primary key and its attributes are atomic.

  • A table is in 2NF if it is in 1NF, and every attribute (that doesn't belong to a candidate key) is not dependent on only part of a candidate key.

  • A relation is in 3NF if it is in 2NF, and every attribute (that doesn't belong to a candidate key) is not dependent on another attribute that doesn't belong to a candidate key.

  • There is a rule to separate the attributes of one table to reduce redundancy without losing any information.

  • After normalizing a relational model, you must remember to update your UML in line with it.

There you have it! You have all of the skills required to model a relational database. But I imagine you probably feel like you haven't quite finished because you haven't even touched your data yet! The final chapter will satisfy your hunger! You will finally get to manipulate your Excel file to fit your model, and, in the end, you can even use SQL to query your database!

Example of certificate of achievement
Example of certificate of achievement