Train a Supervised Machine Learning Model

8 hours
Medium

License

Free online content available in this course.

course.header.alt.is_video

course.header.alt.is_certifying

Got it!

Last updated on 8/5/21

Clean Your Data

Data is Messy!

Estimates suggest that data scientists spend around 60-80% of their time cleaning and preparing data. This is not an over-estimation! Your project data is rarely ready to be loaded straight into your machine learning models.

There may be numerous challenges, such as:

Missing data
Poorly formatted data
Misspellings
Duplicates
Errors
Outliers

Dealing with these sorts of challenges is called data cleansing. We will review these concepts in this chapter, but for more information, feel free to check out a previous course called Perform an Initial Data Analysis. In the next chapter, we will take a look at feature engineering, where you augment data with new features.

Data cleansing deals with the problem of dirty data, and it's nearly always dirty! You need to spend a good amount of time cleaning it up because it makes machine learning activities more difficult or impossible, and can lead to poor or misleading results! :(

Cleaning a Small Dataset

As a refresher, let's walk through the cleanup of a very small dataset.

id	location	date_of_sale	number_of_bedrooms	price	type
0	Clapham	12/04/1999	1	£729,000	apartment
1	Ashford	05/08/2017	unknown	£699,000	semi-detached
2	Stratford-on-Avon	29/03/2012	3		detached
3	Canterbury	01/07/2009	2	£529,000	teraced
4	Camden	16/12/2001	1	£616,000	apartment
5	Rugby	01/03/2003	-	£247,000	detached
6	Hampstead	05/03/2016	2	£0	terraced
7	Clapham	05/07/2001	363	£543,000	apartment
8	Stratford-on-Avon	10th May 2010	3	£420,000	detached
9	Camden	16/12/2001	1	£616,000	apartment

How many data issues can you spot?

Formatting Errors

The date_of_sale in this row is in a different format than the other dates:

id	location	date_of_sale	number_of_bedrooms	price	type
8	Stratford-on-Avon	10th May 2010	3	£420,000	detached

Strings in Numeric Fields

These rows have strings in the number_of_bedrooms feature:

id	location	date_of_sale	number_of_bedrooms	price	type
1	Ashford	05/08/2017	unknown	£699,000	semi-detached
5	Rugby	01/03/2003	-	£247,000	detached

Outliers

The following row has an extreme (and unbelievable) value for number_of_bedrooms:

id	location	date_of_sale	number_of_bedrooms	price	type
7	Clapham	05/07/2001	363	£543,000	apartment

Missing Values

The following row has a missing price:

id	location	date_of_sale	number_of_bedrooms	price	type
2	Stratford-on-Avon	29/03/2012	3		detached

Misspellings

The following row has a misspelling in the type column:

id	location	date_of_sale	number_of_bedrooms	price	type
3	Canterbury	01/07/2009	2	£529,000	teraced

Duplicates

The following rows look like they could be duplicates (although it's possible that two identical apartments were sold together on the same day!):

id	location	date_of_sale	number_of_bedrooms	price	type
4	Camden	16/12/2001	1	£616,000	apartment
9	Camden	16/12/2001	1	£616,000	apartment

Zero for Nulls

The following row has a zero for price, which is probably a case of a zero being entered for an unknown price:

id	location	date_of_sale	number_of_bedrooms	price	type
6	Hampstead	05/03/2016	2	£0	terraced

Data Cleansing Strategies

Convert Data Types

Ensure that features are treated with the correct data type:

Dates
Integers
Floats
Objects

In our example, you would process the data, so the following data types were applied:

Location - object
Date_of_sale - datetime
Number_of_bedrooms - int
Price - int
Type - object

Here is the Python code to achieve this:

df["date_of_sale"] = pd.to_datetime(df["date_of_sale"])
df['number_of_bedrooms'] = pd.to_numeric(df['number_of_bedrooms'])
df['price'] = pd.to_numeric(df['price'])

Replace Values

Replace incorrect, invalid, or troublesome values.

Replace these strings in number_of_bedrooms:

id	location	date_of_sale	number_of_bedrooms	price	type
1	Ashford	05/08/2017	unknown	£699,000	semi-detached
5	Rugby	01/03/2003	-	£247,000	detached

with nulls:

id	location	date_of_sale	number_of_bedrooms	price	type
1	Ashford	05/08/2017	NaN	£699,000	semi-detached
5	Rugby	01/03/2003	NaN	£247,000	detached

Here is the Python code to achieve this:

non_nums = df[~df["number_of_bedrooms"].str.isnumeric()]["number_of_bedrooms"].unique()
df["number_of_bedrooms"] = df["number_of_bedrooms"].replace(non_nums, np.nan)

Replace the misspelling:

id	location	date_of_sale	number_of_bedrooms	price	type
3	Canterbury	01/07/2009	2	£529,000	teraced

with the correct spelling:

id	location	date_of_sale	number_of_bedrooms	price	type
3	Canterbury	01/07/2009	2	£529,000	terraced

Here is the Python code to achieve this:

df["type"] = df["type"].replace(['teraced'], 'terraced')

Remove all the £ and comma symbols in the price column:

id	price	cleaned_price
0	£729,000	729000
1	£699,000	699000
3	£529,000	529000
4	£616,000	616000
5	£247,000	247000
7	£543,000	543000
8	£420,000	420000
9	£616,000	616000

Here is the Python code to achieve this:

df["price"] = df["price"].apply(lambda x: x.replace('£', '') if type(x) is str else x)
df["price"] = df["price"].apply(lambda x: x.replace(',', '') if type(x) is str else x)

Remove Outliers

If you believe an outlier to be erroneous or troublesome, you can remove it.

Consider removing this row:

id	location	date_of_sale	number_of_bedrooms	price	type
7	Clapham	05/07/2001	363	543000	apartment

Here's how to do this in Python:

df = df.drop(7)

Replace Outlier Values

Instead of removing outliers, consider replacing the values with the correct ones (if you can determine them) or impute them with estimated ones (see impute missing values later).

You can replace the number_of_bedrooms with the average number of bedrooms for a £543,000 apartment in Clapham. This could be calculated from the given data or selected from an external dataset.

id	location	date_of_sale	number_of_bedrooms	price	type
7	Clapham	05/07/2001	363	£543,000	apartment

After replacement, the above becomes:

id	location	date_of_sale	number_of_bedrooms	price	type
7	Clapham	05/07/2001	1	£543,000	apartment

Here's one way to do this in Python:

df.loc[7,'number_of_bedrooms'] = np.NaN
df.groupby('location')['number_of_bedrooms'].transform(lambda x: x.fillna(x.mean()))

Remove Duplicates

Remove any true duplicates.

Remove one of these rows:

id	location	date_of_sale	number_of_bedrooms	price	type
4	Camden	16/12/2001	1	£616,000	apartment
9	Camden	16/12/2001	1	£616,000	apartment

Here's the Python code that removes all duplicate rows:

df = df.drop_duplicates()

Handle Missing Values

Having applied the above techniques, we will have a dataset containing nulls. The machine learning algorithms won't be able to process data with nulls, so we need to handle them.

After applying other cleaning techniques, we may have the following:

id	location	date_of_sale	number_of_bedrooms	price	type
0	Clapham	12/04/1999	1	729000	apartment
1	Ashford	05/08/2017	NaN	699000	semi-detached
2	Stratford-on-Avon	29/03/2012	3	NaN	detached
3	Canterbury	01/07/2009	2	529000	terraced
4	Camden	16/12/2001	1	616000	apartment
5	Rugby	01/03/2003	NaN	247000	detached
6	Hampstead	05/03/2016	2	NaN	terraced
7	Clapham	05/07/2001	1	543000	apartment
8	Stratford-on-Avon	10/05/2010	3	420000	detached

There are nulls in the number_of_bedrooms and price columns.

We could apply one of the following strategies:

Remove Columns Containing Nulls

Certain columns containing nulls are not interesting and can be removed.
This is what it would look like:

location	date_of_sale	type
Clapham	12/04/1999	apartment
Ashford	05/08/2017	semi-detached
Stratford-on-Avon	29/03/2012	detached
Canterbury	01/07/2009	terraced
Camden	16/12/2001	apartment
Rugby	01/03/2003	detached
Hampstead	05/03/2016	terraced
Clapham	05/07/2001	apartment
Stratford-on-Avon	10/05/2010	detached

In this case, number_of_bedrooms and price are of such importance that we would not want to remove them.

Here is the Python code to achieve this:

df = df.dropna(axis=1)

Remove Rows Containing Nulls

It might not be a bad idea to lose a few rows that contain nulls. A high number could cost a lot of data!

Applying this strategy to the data would look like:

location	date_of_sale	number_of_bedrooms	price	type
Clapham	12/04/1999	1	729000	apartment
Canterbury	01/07/2009	2	529000	terraced
Camden	16/12/2001	1	616000	apartment
Clapham	05/07/2001	2	543000	apartment
Stratford-on-Avon	10/05/2010	3	420000	detached

Here is the Python code to achieve this:

df = df.dropna()

Impute Nulls With the Mean or Median

To preserve the rows containing nulls, replace the null values with the mean or median across all values for that feature.

Here I have used the median for number_of_bedrooms and the mean for price, giving these updated rows:

location	date_of_sale	number_of_bedrooms	price	type
Ashford	05/08/2017	2	699000	semi-detached
Stratford-on-Avon	29/03/2012	3	549875	detached
Rugby	01/03/2003	2	247000	detached
Hampstead	05/03/2016	2	549875	terraced

Here is the Python code to impute nulls with the mean:

mean = df["price"].mean()    
df["price"] = df["price"].fillna(value=mean)

Do the same with median:

median = df["price"].median()    
df["price"] = df["price"].fillna(value=median)

Impute Nulls With the Group Mean or Median

Replacing nulls with the mean or median across all values for that feature may be too crude; a better approach would be to use the mean or median based on groups within the data.

For example, rather than using the mean price, use the mean price based on the location, the number of bedrooms, and year of sale. This would give a more accurate estimate of the imputed prices.

Here's how to do that in Python:

# Extract the year
df['year'] = df.date_of_sale.dt.year

# Impute null prices with the mean for the location, bedrooms and year of sale
df.groupby(['location','number_of_bedrooms','year'])['price'].transform(lambda x: x.fillna(x.mean()))

Note that this does not affect our tiny dataset, as there isn't enough data in each location, number of bedrooms, and year to compute a mean!

Data Cleansing in Python

Take a look at these screencasts for a walk-though using these techniques. You can find the code on the course GitHub repository.

Part 1: Processing the Main Features

Part 2: Processing Outliers and Duplicates

Part 3: Processing Nulls

Recap

Data is messy, and data cleansing will almost always be a significant part of your machine learning adventures! We looked at some techniques you could use:

Convert data types
Replace values
Remove outliers
Replace outlier values
Remove duplicates
Remove columns containing nulls
Remove rows containing nulls
Impute nulls with the mean or median
Impute nulls with the group mean or median

In the next chapter, we will create new features from existing features. This is called feature engineering!