Last updated on 8/5/21

# Build a Regression Model With the House Price Dataset

### Description

In this quiz, you are going to build models to predict house prices in King County, U.S., based on characteristics of the home. The original dataset can be found here: https://www.kaggle.com/harlfoxem/housesalesprediction/download.

This is a classic dataset with the following variables:

• Square footage of different part of the house: sqft_living, sqft_lot, sqft_above, sqft_basement
• Houses in the neighborhood (in 2015): sqft_living15, sqft_lot15
• Number of bedrooms, bathrooms and floors
• Waterfront or not
• Location variables: zipcode, lat, long
• Date of sale, year built, and renovated: yr_built, yr_renovated
• Quality: condition, grade, and how many times it was viewed: view

The target variable is the price at which the house was sold.

``````import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns``````

In the following exercises, you are not going to work with the date or location columns, so let's drop them from the DataFrame:

``````df.drop(columns = ['id', 'zipcode', 'lat', 'long', 'date'], inplace = True)
``````

You are left with the following 15 predictors and one target variable:

``````df.columns
Index(['price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
'sqft_basement', 'yr_built', 'yr_renovated', 'sqft_living15',
'sqft_lot15']
``````

At this point, you may want to take some time to familiarize yourself with the dataset through visualization and simple descriptive statistics.

• ### Question 1

To build a predictive model, you usually split the dataset into a training set and a testing set. The surface-related variables (sqtf_living, sqft_lot, etc.) and the ordinal variables (floors, bathrooms, etc.) are on very different scales. You would need to scale the data before you can train a linear or KNN model.

What is the proper way to scale the variables using a scikit-learn scaler such as MinMaxScaler or StandardScaler?

• fit_transform() on the whole dataset before splitting into a training and a testing subset

• fit_transform() on the training subset and apply the scaler on the testing subset with transform()

• fit_transform() on the training subset and on the testing subset independently

• fit_transform() on the training subset and inverse_transform() on the testing dataset

• ### Question 2

Looking for outliers in the dataset, you notice that one house has 33 bedrooms, although the living space is only sqft_living = 1620 square feet. That's a lot of bedrooms for an average-sized house.

Similarly, you notice that 10 houses have no bathrooms, including two that sold for over a million dollars. That seems odd, to say the least. What is a good strategy to handle these house samples?

(Hint: Look at how each strategy impacts the R^2 of a simple linear regression model, for the different scenarios below)

• Drop the 11 samples as outliers.

• Replace the value of 33 bedrooms by a more sensible value (3, mean of bedrooms, etc.), and replace the 0 bathrooms value with 1 bathroom in all 10 samples.

• Keep the data as it is because 11 weird samples out of 21613 won't make much of a difference anyway.

• All of the above.

• ### Question 3

Looking at the correlation between the predictors and the target variable, you see that sqft_living and grade are the most correlated variables.

Build the following three linear regression models and compare their R^2 and model coefficients.

• M1: price ~ sqft_living