Data is Messy!
Estimates suggest that data scientists spend around 60-80% of their time cleaning and preparing data. This is not an over-estimation! Your project data is rarely ready to be loaded straight into your machine learning models.
There may be numerous challenges, such as:
Missing data
Poorly formatted data
Misspellings
Duplicates
Errors
Outliers
Dealing with these sorts of challenges is called data cleansing. We will review these concepts in this chapter, but for more information, feel free to check out a previous course called Perform an Initial Data Analysis. In the next chapter, we will take a look at feature engineering, where you augment data with new features.
Data cleansing deals with the problem of dirty data, and it's nearly always dirty! You need to spend a good amount of time cleaning it up because it makes machine learning activities more difficult or impossible, and can lead to poor or misleading results! :(
Cleaning a Small Dataset
As a refresher, let's walk through the cleanup of a very small dataset.
id | location | date_of_sale | number_of_bedrooms | price | type |
0 | Clapham | 12/04/1999 | 1 | £729,000 | apartment |
1 | Ashford | 05/08/2017 | unknown | £699,000 | semi-detached |
2 | Stratford-on-Avon | 29/03/2012 | 3 |
| detached |
3 | Canterbury | 01/07/2009 | 2 | £529,000 | teraced |
4 | Camden | 16/12/2001 | 1 | £616,000 | apartment |
5 | Rugby | 01/03/2003 | - | £247,000 | detached |
6 | Hampstead | 05/03/2016 | 2 | £0 | terraced |
7 | Clapham | 05/07/2001 | 363 | £543,000 | apartment |
8 | Stratford-on-Avon | 10th May 2010 | 3 | £420,000 | detached |
9 | Camden | 16/12/2001 | 1 | £616,000 | apartment |
How many data issues can you spot?
Formatting Errors
The date_of_sale in this row is in a different format than the other dates:
id | location | date_of_sale | number_of_bedrooms | price | type |
8 | Stratford-on-Avon | 10th May 2010 | 3 | £420,000 | detached |
Strings in Numeric Fields
These rows have strings in the number_of_bedrooms feature:
id | location | date_of_sale | number_of_bedrooms | price | type |
1 | Ashford | 05/08/2017 | unknown | £699,000 | semi-detached |
5 | Rugby | 01/03/2003 | - | £247,000 | detached |
Outliers
The following row has an extreme (and unbelievable) value for number_of_bedrooms:
id | location | date_of_sale | number_of_bedrooms | price | type |
7 | Clapham | 05/07/2001 | 363 | £543,000 | apartment |
Missing Values
The following row has a missing price:
id | location | date_of_sale | number_of_bedrooms | price | type |
2 | Stratford-on-Avon | 29/03/2012 | 3 |
| detached |
Misspellings
The following row has a misspelling in the type column:
id | location | date_of_sale | number_of_bedrooms | price | type |
3 | Canterbury | 01/07/2009 | 2 | £529,000 | teraced |
Duplicates
The following rows look like they could be duplicates (although it's possible that two identical apartments were sold together on the same day!):
id | location | date_of_sale | number_of_bedrooms | price | type |
4 | Camden | 16/12/2001 | 1 | £616,000 | apartment |
9 | Camden | 16/12/2001 | 1 | £616,000 | apartment |
Zero for Nulls
The following row has a zero for price, which is probably a case of a zero being entered for an unknown price:
id | location | date_of_sale | number_of_bedrooms | price | type |
6 | Hampstead | 05/03/2016 | 2 | £0 | terraced |
Data Cleansing Strategies
Convert Data Types
Ensure that features are treated with the correct data type:
Dates
Integers
Floats
Objects
In our example, you would process the data, so the following data types were applied:
Location -
object
Date_of_sale -
datetime
Number_of_bedrooms -
int
Price -
int
Type -
object
Here is the Python code to achieve this:
df["date_of_sale"] = pd.to_datetime(df["date_of_sale"])
df['number_of_bedrooms'] = pd.to_numeric(df['number_of_bedrooms'])
df['price'] = pd.to_numeric(df['price'])
Replace Values
Replace incorrect, invalid, or troublesome values.
Replace these strings in number_of_bedrooms:
id | location | date_of_sale | number_of_bedrooms | price | type |
1 | Ashford | 05/08/2017 | unknown | £699,000 | semi-detached |
5 | Rugby | 01/03/2003 | - | £247,000 | detached |
with nulls:
id | location | date_of_sale | number_of_bedrooms | price | type |
1 | Ashford | 05/08/2017 | NaN | £699,000 | semi-detached |
5 | Rugby | 01/03/2003 | NaN | £247,000 | detached |
Here is the Python code to achieve this:
non_nums = df[~df["number_of_bedrooms"].str.isnumeric()]["number_of_bedrooms"].unique()
df["number_of_bedrooms"] = df["number_of_bedrooms"].replace(non_nums, np.nan)
Replace the misspelling:
id | location | date_of_sale | number_of_bedrooms | price | type |
3 | Canterbury | 01/07/2009 | 2 | £529,000 | teraced |
with the correct spelling:
id | location | date_of_sale | number_of_bedrooms | price | type |
3 | Canterbury | 01/07/2009 | 2 | £529,000 | terraced |
Here is the Python code to achieve this:
df["type"] = df["type"].replace(['teraced'], 'terraced')
Remove all the £ and comma symbols in the price column:
id | price | cleaned_price |
0 | £729,000 | 729000 |
1 | £699,000 | 699000 |
3 | £529,000 | 529000 |
4 | £616,000 | 616000 |
5 | £247,000 | 247000 |
7 | £543,000 | 543000 |
8 | £420,000 | 420000 |
9 | £616,000 | 616000 |
Here is the Python code to achieve this:
df["price"] = df["price"].apply(lambda x: x.replace('£', '') if type(x) is str else x)
df["price"] = df["price"].apply(lambda x: x.replace(',', '') if type(x) is str else x)
Remove Outliers
If you believe an outlier to be erroneous or troublesome, you can remove it.
Consider removing this row:
id | location | date_of_sale | number_of_bedrooms | price | type |
7 | Clapham | 05/07/2001 | 363 | 543000 | apartment |
Here's how to do this in Python:
df = df.drop(7)
Replace Outlier Values
Instead of removing outliers, consider replacing the values with the correct ones (if you can determine them) or impute them with estimated ones (see impute missing values later).
You can replace the number_of_bedrooms with the average number of bedrooms for a £543,000 apartment in Clapham. This could be calculated from the given data or selected from an external dataset.
id | location | date_of_sale | number_of_bedrooms | price | type |
7 | Clapham | 05/07/2001 | 363 | £543,000 | apartment |
After replacement, the above becomes:
id | location | date_of_sale | number_of_bedrooms | price | type |
7 | Clapham | 05/07/2001 | 1 | £543,000 | apartment |
Here's one way to do this in Python:
df.loc[7,'number_of_bedrooms'] = np.NaN
df.groupby('location')['number_of_bedrooms'].transform(lambda x: x.fillna(x.mean()))
Remove Duplicates
Remove any true duplicates.
Remove one of these rows:
id | location | date_of_sale | number_of_bedrooms | price | type |
4 | Camden | 16/12/2001 | 1 | £616,000 | apartment |
9 | Camden | 16/12/2001 | 1 | £616,000 | apartment |
Here's the Python code that removes all duplicate rows:
df = df.drop_duplicates()
Handle Missing Values
Having applied the above techniques, we will have a dataset containing nulls. The machine learning algorithms won't be able to process data with nulls, so we need to handle them.
After applying other cleaning techniques, we may have the following:
id | location | date_of_sale | number_of_bedrooms | price | type |
0 | Clapham | 12/04/1999 | 1 | 729000 | apartment |
1 | Ashford | 05/08/2017 | NaN | 699000 | semi-detached |
2 | Stratford-on-Avon | 29/03/2012 | 3 | NaN | detached |
3 | Canterbury | 01/07/2009 | 2 | 529000 | terraced |
4 | Camden | 16/12/2001 | 1 | 616000 | apartment |
5 | Rugby | 01/03/2003 | NaN | 247000 | detached |
6 | Hampstead | 05/03/2016 | 2 | NaN | terraced |
7 | Clapham | 05/07/2001 | 1 | 543000 | apartment |
8 | Stratford-on-Avon | 10/05/2010 | 3 | 420000 | detached |
There are nulls in the number_of_bedrooms and price columns.
We could apply one of the following strategies:
Remove Columns Containing Nulls
Certain columns containing nulls are not interesting and can be removed.
This is what it would look like:
location | date_of_sale | type |
Clapham | 12/04/1999 | apartment |
Ashford | 05/08/2017 | semi-detached |
Stratford-on-Avon | 29/03/2012 | detached |
Canterbury | 01/07/2009 | terraced |
Camden | 16/12/2001 | apartment |
Rugby | 01/03/2003 | detached |
Hampstead | 05/03/2016 | terraced |
Clapham | 05/07/2001 | apartment |
Stratford-on-Avon | 10/05/2010 | detached |
In this case, number_of_bedrooms and price are of such importance that we would not want to remove them.
Here is the Python code to achieve this:
df = df.dropna(axis=1)
Remove Rows Containing Nulls
It might not be a bad idea to lose a few rows that contain nulls. A high number could cost a lot of data!
Applying this strategy to the data would look like:
location | date_of_sale | number_of_bedrooms | price | type |
Clapham | 12/04/1999 | 1 | 729000 | apartment |
Canterbury | 01/07/2009 | 2 | 529000 | terraced |
Camden | 16/12/2001 | 1 | 616000 | apartment |
Clapham | 05/07/2001 | 2 | 543000 | apartment |
Stratford-on-Avon | 10/05/2010 | 3 | 420000 | detached |
Here is the Python code to achieve this:
df = df.dropna()
Impute Nulls With the Mean or Median
To preserve the rows containing nulls, replace the null values with the mean or median across all values for that feature.
Here I have used the median for number_of_bedrooms and the mean for price, giving these updated rows:
location | date_of_sale | number_of_bedrooms | price | type |
Ashford | 05/08/2017 | 2 | 699000 | semi-detached |
Stratford-on-Avon | 29/03/2012 | 3 | 549875 | detached |
Rugby | 01/03/2003 | 2 | 247000 | detached |
Hampstead | 05/03/2016 | 2 | 549875 | terraced |
Here is the Python code to impute nulls with the mean:
mean = df["price"].mean()
df["price"] = df["price"].fillna(value=mean)
Do the same with median:
median = df["price"].median()
df["price"] = df["price"].fillna(value=median)
Impute Nulls With the Group Mean or Median
Replacing nulls with the mean or median across all values for that feature may be too crude; a better approach would be to use the mean or median based on groups within the data.
For example, rather than using the mean price, use the mean price based on the location, the number of bedrooms, and year of sale. This would give a more accurate estimate of the imputed prices.
Here's how to do that in Python:
# Extract the year
df['year'] = df.date_of_sale.dt.year
# Impute null prices with the mean for the location, bedrooms and year of sale
df.groupby(['location','number_of_bedrooms','year'])['price'].transform(lambda x: x.fillna(x.mean()))
Note that this does not affect our tiny dataset, as there isn't enough data in each location, number of bedrooms, and year to compute a mean!
Data Cleansing in Python
Take a look at these screencasts for a walk-though using these techniques. You can find the code on the course GitHub repository.
Part 1: Processing the Main Features
Part 2: Processing Outliers and Duplicates
Part 3: Processing Nulls
Recap
Data is messy, and data cleansing will almost always be a significant part of your machine learning adventures! We looked at some techniques you could use:
Convert data types
Replace values
Remove outliers
Replace outlier values
Remove duplicates
Remove columns containing nulls
Remove rows containing nulls
Impute nulls with the mean or median
Impute nulls with the group mean or median
In the next chapter, we will create new features from existing features. This is called feature engineering!