• 8 hours
  • Medium

Free online content available in this course.



Got it!

Last updated on 8/5/21

Build a Regression Model With the House Price Dataset

Log in or subscribe for free to enjoy all this course has to offer!

Evaluated skills

  • Build a supervised learning model to address a regression task


In this quiz, you are going to build models to predict house prices in King County, U.S., based on characteristics of the home. The original dataset can be found here: https://www.kaggle.com/harlfoxem/housesalesprediction/download.

This is a classic dataset with the following variables:

  • Square footage of different part of the house: sqft_living, sqft_lot, sqft_above, sqft_basement
  • Houses in the neighborhood (in 2015): sqft_living15, sqft_lot15
  • Number of bedrooms, bathrooms and floors
  • Waterfront or not
  • Location variables: zipcode, lat, long
  • Date of sale, year built, and renovated: yr_built, yr_renovated
  • Quality: condition, grade, and how many times it was viewed: view

The target variable is the price at which the house was sold.

Start by importing the libraries and loading the dataset:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In the following exercises, you are not going to work with the date or location columns, so let's drop them from the DataFrame:

df.drop(columns = ['id', 'zipcode', 'lat', 'long', 'date'], inplace = True)

You are left with the following 15 predictors and one target variable:

Index(['price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
       'waterfront', 'view', 'condition', 'grade', 'sqft_above',
       'sqft_basement', 'yr_built', 'yr_renovated', 'sqft_living15',

At this point, you may want to take some time to familiarize yourself with the dataset through visualization and simple descriptive statistics.

  • Question 1

    To build a predictive model, you usually split the dataset into a training set and a testing set. The surface-related variables (sqtf_living, sqft_lot, etc.) and the ordinal variables (floors, bathrooms, etc.) are on very different scales. You would need to scale the data before you can train a linear or KNN model.

    What is the proper way to scale the variables using a scikit-learn scaler such as MinMaxScaler or StandardScaler?

    • fit_transform() on the whole dataset before splitting into a training and a testing subset

    • fit_transform() on the training subset and apply the scaler on the testing subset with transform()

    • fit_transform() on the training subset and on the testing subset independently

    • fit_transform() on the training subset and inverse_transform() on the testing dataset

  • Question 2

    Looking for outliers in the dataset, you notice that one house has 33 bedrooms, although the living space is only sqft_living = 1620 square feet. That's a lot of bedrooms for an average-sized house.

    Similarly, you notice that 10 houses have no bathrooms, including two that sold for over a million dollars. That seems odd, to say the least. What is a good strategy to handle these house samples?

    (Hint: Look at how each strategy impacts the R^2 of a simple linear regression model, for the different scenarios below)

    • Drop the 11 samples as outliers.

    • Replace the value of 33 bedrooms by a more sensible value (3, mean of bedrooms, etc.), and replace the 0 bathrooms value with 1 bathroom in all 10 samples.

    • Keep the data as it is because 11 weird samples out of 21613 won't make much of a difference anyway.

    • All of the above.

  • Question 3

    Looking at the correlation between the predictors and the target variable, you see that sqft_living and grade are the most correlated variables.

    Build the following three linear regression models and compare their R^2 and model coefficients.

    • M1: price ~ sqft_living
    • M2: price ~ grade
    • M3: price ~ sqft_living + grade

    Which of the following assertions are true?

    Careful, there are several correct answers.
    • The coefficients of M3 are smaller than the coefficients of M1 and M2 because the two predictors are highly correlated.

    • The R^2 for M3 is incorrect because the two predictors are highly correlated. It should be much higher.

    • The M3 model is worse than M1 and M2 because it's R^2 is higher.

    • M3 and M1 are both better than M2 in terms of R^2.