• 8 hours
  • Medium

Free online content available in this course.

course.header.alt.is_video

course.header.alt.is_certifying

Got it!

Last updated on 8/5/21

Clean Your Data

Data is Messy!

Estimates suggest that data scientists spend around 60-80% of their time cleaning and preparing data. This is not an over-estimation! Your project data is rarely ready to be loaded straight into your machine learning models.

There may be numerous challenges, such as:

  • Missing data

  • Poorly formatted data 

  • Misspellings

  • Duplicates

  • Errors

  • Outliers

Dealing with these sorts of challenges is called data cleansing. We will review these concepts in this chapter, but for more information, feel free to check out a previous course called Perform an Initial Data Analysis. In the next chapter, we will take a look at feature engineering, where you augment data with new features.

Data cleansing deals with the problem of dirty data, and it's nearly always dirty! You need to spend a good amount of time cleaning it up because it makes machine learning activities more difficult or impossible, and can lead to poor or misleading results! :(

Cleaning a Small Dataset

As a refresher, let's walk through the cleanup of a very small dataset.

id

location

date_of_sale

number_of_bedrooms

price

type

0

Clapham

12/04/1999

1

£729,000

apartment

1

Ashford

05/08/2017

unknown

£699,000

semi-detached

2

Stratford-on-Avon

29/03/2012

3

 

detached

3

Canterbury

01/07/2009

2

£529,000

teraced

4

Camden

16/12/2001

1

£616,000

apartment

5

Rugby

01/03/2003

-

£247,000

detached

6

Hampstead

05/03/2016

2

£0

terraced

7

Clapham

05/07/2001

363

£543,000

apartment

8

Stratford-on-Avon

10th May 2010

3

£420,000

detached

9

Camden

16/12/2001

1

£616,000

apartment

How many data issues can you spot?

Formatting Errors

The date_of_sale in this row is in a different format than the other dates:

id

location

date_of_sale

number_of_bedrooms

price

type

8

Stratford-on-Avon

10th May 2010

3

£420,000

detached

Strings in Numeric Fields

These rows have strings in the number_of_bedrooms feature:

id

location

date_of_sale

number_of_bedrooms

price

type

1

Ashford

05/08/2017

unknown

£699,000

semi-detached

5

Rugby

01/03/2003

-

£247,000

detached

Outliers

The following row has an extreme (and unbelievable) value for number_of_bedrooms:

id

location

date_of_sale

number_of_bedrooms

price

type

7

Clapham

05/07/2001

363

£543,000

apartment

Missing Values

The following row has a missing price:

id

location

date_of_sale

number_of_bedrooms

price

type

2

Stratford-on-Avon

29/03/2012

3

 

detached

Misspellings

The following row has a misspelling in the type column:

id

location

date_of_sale

number_of_bedrooms

price

type

3

Canterbury

01/07/2009

2

£529,000

teraced

Duplicates

The following rows look like they could be duplicates (although it's possible that two identical apartments were sold together on the same day!):

id

location

date_of_sale

number_of_bedrooms

price

type

4

Camden

16/12/2001

1

£616,000

apartment

9

Camden

16/12/2001

1

£616,000

apartment

Zero for Nulls

The following row has a zero for price, which is probably a case of a zero being entered for an unknown price:

id

location

date_of_sale

number_of_bedrooms

price

type

6

Hampstead

05/03/2016

2

£0

terraced

Data Cleansing Strategies

Convert Data Types

Ensure that features are treated with the correct data type:

  • Dates

  • Integers

  • Floats

  • Objects

In our example, you would process the data, so the following data types were applied:

  • Locationobject

  • Date_of_saledatetime

  • Number_of_bedroomsint

  • Price -  int

  • Typeobject

Here is the Python code to achieve this:

df["date_of_sale"] = pd.to_datetime(df["date_of_sale"])
df['number_of_bedrooms'] = pd.to_numeric(df['number_of_bedrooms'])
df['price'] = pd.to_numeric(df['price'])

Replace Values

Replace incorrect, invalid, or troublesome values.

Replace these strings in number_of_bedrooms:

id

location

date_of_sale

number_of_bedrooms

price

type

1

Ashford

05/08/2017

unknown

£699,000

semi-detached

5

Rugby

01/03/2003

-

£247,000

detached

with nulls:

id

location

date_of_sale

number_of_bedrooms

price

type

1

Ashford

05/08/2017

NaN

£699,000

semi-detached

5

Rugby

01/03/2003

NaN

£247,000

detached

Here is the Python code to achieve this:

non_nums = df[~df["number_of_bedrooms"].str.isnumeric()]["number_of_bedrooms"].unique()
df["number_of_bedrooms"] = df["number_of_bedrooms"].replace(non_nums, np.nan)

Replace the misspelling:

id

location

date_of_sale

number_of_bedrooms

price

type

3

Canterbury

01/07/2009

2

£529,000

teraced

with the correct spelling:

id

location

date_of_sale

number_of_bedrooms

price

type

3

Canterbury

01/07/2009

2

£529,000

terraced

Here is the Python code to achieve this:

df["type"] = df["type"].replace(['teraced'], 'terraced')

Remove all the £ and comma symbols in the price column:

id

price

cleaned_price

0

£729,000

729000

1

£699,000

699000

3

£529,000

529000

4

£616,000

616000

5

£247,000

247000

7

£543,000

543000

8

£420,000

420000

9

£616,000

616000

Here is the Python code to achieve this:

df["price"] = df["price"].apply(lambda x: x.replace('£', '') if type(x) is str else x)
df["price"] = df["price"].apply(lambda x: x.replace(',', '') if type(x) is str else x)

Remove Outliers

If you believe an outlier to be erroneous or troublesome, you can remove it.

Consider removing this row:

id

location

date_of_sale

number_of_bedrooms

price

type

7

Clapham

05/07/2001

363

543000

apartment

Here's how to do this in Python:

df = df.drop(7)

Replace Outlier Values

Instead of removing outliers, consider replacing the values with the correct ones (if you can determine them) or impute them with estimated ones (see impute missing values later).

You can replace the number_of_bedrooms with the average number of bedrooms for a £543,000 apartment in Clapham. This could be calculated from the given data or selected from an external dataset.

id

location

date_of_sale

number_of_bedrooms

price

type

7

Clapham

05/07/2001

363

£543,000

apartment

After replacement, the above becomes:

id

location

date_of_sale

number_of_bedrooms

price

type

7

Clapham

05/07/2001

1

£543,000

apartment

Here's one way to do this in Python:

df.loc[7,'number_of_bedrooms'] = np.NaN
df.groupby('location')['number_of_bedrooms'].transform(lambda x: x.fillna(x.mean()))

Remove Duplicates

Remove any true duplicates.

Remove one of these rows:

id

location

date_of_sale

number_of_bedrooms

price

type

4

Camden

16/12/2001

1

£616,000

apartment

9

Camden

16/12/2001

1

£616,000

apartment

Here's the Python code that removes all duplicate rows:

df = df.drop_duplicates()

Handle Missing Values

Having applied the above techniques, we will have a dataset containing nulls.  The machine learning algorithms won't be able to process data with nulls, so we need to handle them.

After applying other cleaning techniques, we may have the following:

id

location

date_of_sale

number_of_bedrooms

price

type

0

Clapham

12/04/1999

1

729000

apartment

1

Ashford

05/08/2017

NaN

699000

semi-detached

2

Stratford-on-Avon

29/03/2012

3

NaN

detached

3

Canterbury

01/07/2009

2

529000

terraced

4

Camden

16/12/2001

1

616000

apartment

5

Rugby

01/03/2003

NaN

247000

detached

6

Hampstead

05/03/2016

2

NaN

terraced

7

Clapham

05/07/2001

1

543000

apartment

8

Stratford-on-Avon

10/05/2010

3

420000

detached

There are nulls in the number_of_bedrooms and price columns.

We could apply one of the following strategies:

Remove Columns Containing Nulls

Certain columns containing nulls are not interesting and can be removed.
This is what it would look like:

location

date_of_sale

type

Clapham

12/04/1999

apartment

Ashford

05/08/2017

semi-detached

Stratford-on-Avon

29/03/2012

detached

Canterbury

01/07/2009

terraced

Camden

16/12/2001

apartment

Rugby

01/03/2003

detached

Hampstead

05/03/2016

terraced

Clapham

05/07/2001

apartment

Stratford-on-Avon

10/05/2010

detached

In this case, number_of_bedrooms and price are of such importance that we would not want to remove them.

Here is the Python code to achieve this:

df = df.dropna(axis=1)
Remove Rows Containing Nulls

It might not be a bad idea to lose a few rows that contain nulls. A high number could cost a lot of data!

Applying this strategy to the data would look like:

location

date_of_sale

number_of_bedrooms

price

type

Clapham

12/04/1999

1

729000

apartment

Canterbury

01/07/2009

2

529000

terraced

Camden

16/12/2001

1

616000

apartment

Clapham

05/07/2001

2

543000

apartment

Stratford-on-Avon

10/05/2010

3

420000

detached

Here is the Python code to achieve this:

df = df.dropna()
Impute Nulls With the Mean or Median

To preserve the rows containing nulls, replace the null values with the mean or median across all values for that feature.

Here I have used the median for number_of_bedrooms and the mean for price, giving these updated rows:

location

date_of_sale

number_of_bedrooms

price

type

Ashford

05/08/2017

2

699000

semi-detached

Stratford-on-Avon

29/03/2012

3

549875

detached

Rugby

01/03/2003

2

247000

detached

Hampstead

05/03/2016

2

549875

terraced

Here is the Python code to impute nulls with the mean:

mean = df["price"].mean()    
df["price"] = df["price"].fillna(value=mean) 

Do the same with median:

median = df["price"].median()    
df["price"] = df["price"].fillna(value=median) 
Impute Nulls With the Group Mean or Median

Replacing nulls with the mean or median across all values for that feature may be too crude; a better approach would be to use the mean or median based on groups within the data.

For example, rather than using the mean price, use the mean price based on the location, the number of bedrooms, and year of sale. This would give a more accurate estimate of the imputed prices.

Here's how to do that in Python:

# Extract the year
df['year'] = df.date_of_sale.dt.year

# Impute null prices with the mean for the location, bedrooms and year of sale
df.groupby(['location','number_of_bedrooms','year'])['price'].transform(lambda x: x.fillna(x.mean()))

Note that this does not affect our tiny dataset, as there isn't enough data in each location, number of bedrooms, and year to compute a mean!

Data Cleansing in Python

Take a look at these screencasts for a walk-though using these techniques. You can find the code on the course GitHub repository.

Part 1: Processing the Main Features

Part 2: Processing Outliers and Duplicates

Part 3: Processing Nulls

Recap

Data is messy, and data cleansing will almost always be a significant part of your machine learning adventures!  We looked at some techniques you could use:

  • Convert data types

  • Replace values

  • Remove outliers

  • Replace outlier values

  • Remove duplicates

  • Remove columns containing nulls

  • Remove rows containing nulls

  • Impute nulls with the mean or median

  • Impute nulls with the group mean or median

In the next chapter, we will create new features from existing features. This is called feature engineering!

Ever considered an OpenClassrooms diploma?
  • Up to 100% of your training program funded
  • Flexible start date
  • Career-focused projects
  • Individual mentoring
Find the training program and funding option that suits you best
Example of certificate of achievement
Example of certificate of achievement