In this course, you will be learning how to cleanse and describe your data.
When you begin analyzing your data, you need to cleanse them of all errors. If not, the code you write to make your lovely graphs (and other representations) will crash. Worse, if your sample contains errors, your analyses may well contain errors, too.
Every data analyst or data scientist will tell you that, unfortunately, we spend more time cleansing our data than analyzing them. Cleansing your dataset is a crucial component in data analysis, even though it may not be the most exciting part.
Wait, why do we have errors in the first place?
It’s all comes down to a question of data source.
If you have data input by humans, it is highly likely to have errors. Imagine you are volunteering as a poll worker for a local election and are asked to gather elections results on paper and then manually type them into a spreadsheet program. At some point, some way, an error is bound to occur!
Consider another situation where data is provided by sensors (for example, a geolocation system on your cell phone, a thermometer, or speed sensors on your car). An error can occur if the sensor has deteriorated over time and is no longer or no longer works at all (is no longer sending data).
Identify the Different Error Types
Let’s take the example of a sample of people described by a number of different variables:
Date of Birth
12 Sept. 1984
Well... You can see that this sample is not super clean, right?
Can you point out a few inconsistencies? Write them down a few and check your answers below!
First, there are empty cells for the "country" and "date of birth variables". We call these missing attributes.
If you look at the "Country" column, you see a cell that contains 24. “24” is definitely not a country! This is known as a lexical error.
Next, you may notice in the "Height" column that there is an entry with a different unit of measure. Indeed, Rodney's height is recorded in feet and inches while the rest are recorded in meters. This is an irregularity error because the unit of measures are not uniform.
Mark has two email addresses. It’s is not necessarily a problem, but if you forget about this and code an analysis program based on the assumption that each person has only one email address, your program will probably crash! This is called a formatting error.
Look at the "date of birth" variable. There is also a formatting error here as Rodney’s date of birth is not recorded in the same format as the others.
Samuel appears on two different rows. But, how can we be sure this is the same Samuel? By his email address, of course! This is called a duplication error. But look closer, Samuel’s two rows each give a different value for the "height variable": 1.67m and 1.45m. This is called a contradiction error.
Hannah is apparently 9'1". This height diverges greatly from the normal heights of human beings. This value is, therefore, referred to as an outlier.
Deal With These Errors
I’ll tell you right away that, when it comes to cleansing data sets, there is no set rule. Everything you do depends on how you plan to use your data. No two data analysts will cleanse the same data set the same way—not if their objectives are different!
So there’s no set rule, but I can give you a few pointers:
1. Missing attributes will be addressed in the following chapter.
2. For the invalid country, it’s possible to supply a list of authorized countries in advance, then eliminate all of the values that are not found on this list (hint: 24 will not be found). Such a list is often referred to as a dictionary.
For irregularity errors, it’s more complicated! You can, for example, set a fixed format (here: a decimal number followed by the letter “m” for “meter”) and eliminate values that don’t adhere to it. But we can do better, by first detecting what unit the value is expressed in (meters or centimeters) then converting everything to the same unit.
For the formatting error of the duplicate email address, it all depends on what you want to do. If you won’t be looking at emails in your future analysis, there’s no need to correct this error. If, on the other hand, you want to know the proportion of people whose address ends in, for example @example.com, or @supermail.eu, etc., then you can choose between:
Taking the first email address and forgetting the second one.
Keeping all email addresses.
Let’s move on to the Date of Birth variable. Aaaaaaah, dates! Believe me, they will always give you headaches! There are many different formats; each country has its own custom when it comes to writing dates (France and North America, for example, do not use the same format). Add to this the problem of time zones! In our case, the simplest solution would be to eliminate dates that are not in the desired format month/day/year.
Duplicates will be discussed in the next chapter.
Outliers will also be discussed in the next chapter!
However, if there are many errors of the same type, you might as well create a program to correct them. For example, if 60% of the heights are expressed in meters, 35% in centimeters, and 5% in other units, then 35% of the errors are of the same type (35% of the values are expressed in centimeters instead of meters). In this case, you might as well write a few lines of code to convert the centimeters to meters. If you are motivated and the data set is worth it to you, go ahead and correct the remaining 5%, although that might take you longer!
Take It Further: External Resources
In this course, we will cleanse our data using Python.
But you should know that there is a very good tool for cleansing data, accessible to non-programmers, called OpenRefine.