In the last chapter, we saw that a sample can contain missing values, outliers, and duplicates. What to do?
When the sample contains missing attributes, there is unfortunately no miracle cure! However, there are several possible approaches you can take.
For a given variable (for instance, last chapter’s example of Date of Birth), if the proportion of missing attributes is low, you can just forget about them and do nothing: leave the sample intact. You will then be working with a data set that has “holes,” like a Swiss cheese. Depending on the statistical process you plan to apply, this solution might be acceptable, or it might not.
Forget a Variable
However, if for this same variable, the proportion of missing attributes is way too high, you’d better just forget about it—provided that the variable is not too important to your analysis. This is the same as not including a column in a table, as we saw in the last chapter.
If the variable with the missing data is crucial to your analysis, it’s better to create a sub-sample, removing the individuals for whom this variable is missing. For example, if you are analyzing your bank statements by looking at the amount of money you spend/earn, the “transaction amount” variable will be crucial. If the transaction amount is unknown for some of the rows of your statement, it’s better to create a sub-sample that removes all of the offending rows.
However, this method is risky. You might find yourself with a number of individuals (a number of rows) so small that your analysis no longer meaningful. In addition, your sample might no longer be representative of the overall population. To find out why, go to the Take It Further section at the end of this chapter.
A more adventurous approach consists of filling in your holes with values you have guessed. This is pretty much the method for daredevils! Of course, these values will not correspond to actual values, but some methods manage to create values that are not too far off. Guessing a missing attribute is referred to as Imputation.
For example, we can replace the missing attributes of the height variable with the average height of the individuals in our sample. In our example, to correct Hannah’s height (which we assume is erroneous), we would replace it with the average height of the other individuals in the sample, which is 1.52 m. This is known as Mean Imputation.
Guess Based on Other Variables
But we can do better! To replace a given variable, we can look at the variables around it. A number of methods apply this principle.
Imagine a new individual named Luke, born in 1991, whose height is unknown. Rather than assign him the mean of the entire sample (1.52 m), we can assign him the average height of people who are about his age. So let’s assign him the average height of people born between 1990 and 2000, or 1.49 m. Here, we looked at the value of the date_of_birth variable to come up with a value for the height variable.
Other methods also deduce the value of a variable by looking at other variables. These include Hot-deck, and methods based on linear regression.
Hannah is 3.45 meters tall. You think that’s not so tall? You’re wrong. It’s very tall compared with the heights of other human beings.
But proceed cautiously, because an outlier value isn’t always necessarily false! Hannah might actually be 3.45 meters tall. Okay, that’s hard to believe—but it’s possible.
An outlier can be:
An aberration: a value that’s obviously false
An atypical value: a value that “deviates from the norm,” but is not necessarily false.
Ideally, outliers should be checked to determine whether or not they’re erroneous. For example, a thermometer in Canada in April might indicate 40°C, but this could be due to a defective temperature sensor, or it could be an actual value....(although it usually a little colder in Canada in the spring ).
So what should we do with outliers? If we are sure that the value is erroneous (input error or flawed sensor, for example) and we can’t find the actual value, it has to be deleted. If we are not sure whether it’s erroneous, we can choose between:
Deleting the value. We then find ourselves with a missing attribute, to which we can impute a value, as we saw previously. But imputation isn’t mandatory.
Keeping the value.
How to choose between these two options? It all depends on how you will be processing your data after that. Some methods are considered “robust,” meaning they are not destabilized by outliers. For example, we will see below that the mean is very sensitive to outliers, while the median is not. So if you want to find a mean, create a sub-sample in which you don’t include outliers. But if you also want to calculate a median, work with the original sample.
What about Duplicates?
In our example, Samuel appears twice. That’s a problem, because this duplicate compromises the analysis, in particular by falsifying the sample’s average height.
Duplicates have to be removed. However, there is no precise rule for detecting them: you alone can find them, based on the structure of your data and your knowledge of how the data were collected. But sometimes, it will be impossible.
A little example: if your sample contains an “identifier” variable, then it’s easy to detect duplicates. They are the ones with the same identifier . In our example, we can consider the email address to be an individual’s identifier. In our example, the two rows containing the email address email@example.com constitute a duplicate.
Another example: Say you are analyzing temperature records from a small town. The town has two weather stations. Station 1 operated for many years until January 15, 2019, and then was taken offline, due to age. Because this was expected, Station 2 had already been installed (in the same place) to take over for it. Station 2 began operating on January 2, 2019. Your sample is therefore made up of records from both stations. However, records made between January 2 and January 15 are duplicates, because both stations were operating at the same time. For each date in this interim period, therefore, you must delete one of the two records.
Yes but which of our two rows containing firstname.lastname@example.org do we delete? Do we just pick one at random?
Cases like this call for greater attention. It’s better, in fact, to group the duplicates in one row. Of the two rows in our sample, the first informs us that Samuel was born on 20/09/2001, and the second informs us that Samuel lives in Benin (information that is missing from the first row). The real problem has to do with the height: the first row tells us that Samuel is 1.67 meters tall, while the second tells us that he’s only 1.45 meters tall. That’s a contradiction. If there is no other means of verifying Samuel’s height, we can, for example, choose to take the mean of these two values.
Take It Further: The Consequences of Removing Individuals
Imagine a sample of people represented in the same form as in the last chapter:
Date of Birth
You decide to delete all of the dates of birth that don’t conform to the format day/month/year, which creates missing attributes for the date of birth variable. Then you decide to delete all of the rows (all of the individuals) that have a missing date of birth. You will probably end up removing all of the people who live in the United States, because their dates are expressed in a different format from those of French-speaking countries. If you then perform an analysis on the heights, your sample will no longer be representative, because people from the United States surely have a different average height than those of other countries.