Last updated on 8/5/21
Build a Regression Model With the House Price Dataset
- Build a supervised learning model to address a regression task
In this quiz, you are going to build models to predict house prices in King County, U.S., based on characteristics of the home. The original dataset can be found here: https://www.kaggle.com/harlfoxem/housesalesprediction/download.
This is a classic dataset with the following variables:
- Square footage of different part of the house: sqft_living, sqft_lot, sqft_above, sqft_basement
- Houses in the neighborhood (in 2015): sqft_living15, sqft_lot15
- Number of bedrooms, bathrooms and floors
- Waterfront or not
- Location variables: zipcode, lat, long
- Date of sale, year built, and renovated: yr_built, yr_renovated
- Quality: condition, grade, and how many times it was viewed: view
The target variable is the price at which the house was sold.
Start by importing the libraries and loading the dataset:
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns
In the following exercises, you are not going to work with the date or location columns, so let's drop them from the DataFrame:
df.drop(columns = ['id', 'zipcode', 'lat', 'long', 'date'], inplace = True)
You are left with the following 15 predictors and one target variable:
df.columns Index(['price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'sqft_living15', 'sqft_lot15']
At this point, you may want to take some time to familiarize yourself with the dataset through visualization and simple descriptive statistics.
To build a predictive model, you usually split the dataset into a training set and a testing set. The surface-related variables (sqtf_living, sqft_lot, etc.) and the ordinal variables (floors, bathrooms, etc.) are on very different scales. You would need to scale the data before you can train a linear or KNN model.
fit_transform() on the whole dataset before splitting into a training and a testing subset
fit_transform() on the training subset and apply the scaler on the testing subset with transform()
fit_transform() on the training subset and on the testing subset independently
fit_transform() on the training subset and inverse_transform() on the testing dataset
Looking for outliers in the dataset, you notice that one house has 33 bedrooms, although the living space is only sqft_living = 1620 square feet. That's a lot of bedrooms for an average-sized house.
Similarly, you notice that 10 houses have no bathrooms, including two that sold for over a million dollars. That seems odd, to say the least. What is a good strategy to handle these house samples?
(Hint: Look at how each strategy impacts the R^2 of a simple linear regression model, for the different scenarios below)
Drop the 11 samples as outliers.
Replace the value of 33 bedrooms by a more sensible value (3, mean of bedrooms, etc.), and replace the 0 bathrooms value with 1 bathroom in all 10 samples.
Keep the data as it is because 11 weird samples out of 21613 won't make much of a difference anyway.
All of the above.
Looking at the correlation between the predictors and the target variable, you see that sqft_living and grade are the most correlated variables.
Build the following three linear regression models and compare their R^2 and model coefficients.
- M1: price ~ sqft_living
- M2: price ~ grade
- M3: price ~ sqft_living + grade
Which of the following assertions are true?Careful, there are several correct answers.
The coefficients of M3 are smaller than the coefficients of M1 and M2 because the two predictors are highly correlated.
The R^2 for M3 is incorrect because the two predictors are highly correlated. It should be much higher.
The M3 model is worse than M1 and M2 because it's R^2 is higher.
M3 and M1 are both better than M2 in terms of R^2.