Now that you have an understanding of a couple of regression algorithms (namely, linear regression and k-nearest neighbors), let's build some regression models in Python.
We will follow the same process used previously for our classification tasks:
We will use the same world indicators dataset, so refer back to Part 3, Chapter 3, for a reminder of the origins and description of this data.
Import the Necessary Libraries
Let's start by importing the necessary libraries - pandas, NumPy, and matplotlib, as well as some sklearn functions. We will also import a small library of convenience functions used specifically for this course. Feel free to open up functions.py to see what's inside.
# Core librariesimport pandas as pdimport numpy as npimport matplotlib.pyplot as plt# Sklearn processingfrom sklearn.preprocessing import MinMaxScalerfrom sklearn.model_selection import train_test_split# Sklearn regression algorithmsfrom sklearn.linear_model import LinearRegressionfrom sklearn.neighbors import KNeighborsRegressorfrom sklearn.tree import DecisionTreeRegressor# Sklearn regression model evaluation functionfrom sklearn.metrics import mean_absolute_error# Convenience functions. This can be found on the course githubfrom functions import *
Define the Task
The task is:
Make predictions about a country's life expectancy in years from a set of metrics for the country.
The only difference between this task and the one we defined for our classification task is that we are predicting the life expectancy in years, rather than as L, M, and H bands. Remember that regression is about predicting a continuous numeric feature and classification is about predicting a category.
Acquire Clean Data
This is the same data used for the classification task, so I'm going to quickly clean it up using the same code used in that task:
# Load the datadataset = pd.read_csv("world_data.csv")# Remove sparsely populated featuresdataset = dataset.drop(["murder","urbanpopulation","unemployment"], axis=1)# Impute all features with meanmeans = dataset.mean().to_dict()for m in means:dataset[m] = dataset[m].fillna(value=means[m])
Understand the Data
We spent some time in Part 3, Chapter 3, inspecting and visualizing this data. I won't repeat that here, but I will introduce some other interesting visualizations.
First, a little bonus! If you run the following, you will get a list of alternative styles for your charts:
['bmh', 'classic', 'dark_background', 'fast', 'fivethirtyeight', 'ggplot', 'grayscale', 'seaborn-bright', 'seaborn-colorblind', 'seaborn-dark-palette', 'seaborn-dark', 'seaborn-darkgrid', 'seaborn-deep', 'seaborn-muted', 'seaborn-notebook', 'seaborn-paper', 'seaborn-pastel', 'seaborn-poster', 'seaborn-talk', 'seaborn-ticks', 'seaborn-white', 'seaborn-whitegrid', 'seaborn', 'Solarize_Light2', 'tableau-colorblind10', '_classic_test']
You can then select one of the styles to make your charts look a little nicer!
You can understand the distribution of each individual feature using histograms:
Another interesting plot is the scatter matrix. This shows the correlation between pairs of features:
For example, you can look at the correlation between lifeexp and sanitation and see a positive correlation:
And lifeexp and fertility and see a negative correlation:
Again, this gives you a better feel for the data, and in particular, how the target feature, lifeexp, relates to the other features. After all, this is the relationship the machine learning algorithms will use to make their predictions!
Select Features and Split Into Input and Target Features
We will predict lifeexp, and this feature becomes our y. We will use all the other features as our inputs, X:
y = dataset["lifeexp"]X = dataset[['happiness', 'income', 'sanitation', 'water', 'literacy', 'inequality', 'energy', 'childmortality', 'fertility', 'hiv', 'foodsupply', 'population']]
As discussed in the classification exercise performed with this data, scale the data before building the model to ensure the features are presented without a scale bias to the selected algorithms.
In this case, I will use the MinMaxScaler(), which scales the data, so every feature sits in the range 0 to 1.
# Rescale the datascaler = MinMaxScaler(feature_range=(0,1))rescaledX = scaler.fit_transform(X)# Convert X back to a Pandas DataFrame, for convenienceX = pd.DataFrame(rescaledX, index=X.index, columns=X.columns)
You learned about two algorithms for regression: linear regression and k-nearest neighbors. We will build models using both of these. I will also build a model with another algorithm, support vector machine. We haven't explored this yet, but sklearn makes it easy to try new algorithms. We can try it out and explore further if it looks promising!
Split Into Test and Training Sets
We will build all three models using the same training set and evaluate them with the same test set. So, let's split into test and training sets now:
test_size = 0.33seed = 1X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=test_size, random_state=seed)
Create Multiple Models, Fit and Check Them
Let's build the models. I'll take a slightly different approach this time, to reduce the amount of coding when building and evaluating multiple models. First, I'll create a list of the untrained models. SVR is the support vector machine model (SVR stands for support vector regressor).
models = [LinearRegression(), KNeighborsRegressor(), SVR()]
Now, I will loop through all the models and fit them to the data:
for model in models:model.fit(X_train, Y_train)predictions = model.predict(X_train)print(type(model).__name__, mean_absolute_error(Y_train, predictions))
You can see the mean absolute error (MAE) for each model based on the training data. The MAE is the mean of the sum of the absolute values of all prediction errors. In other words, for each prediction, subtract the predicted value from the actual value, take the absolute value, sum up all these, and divide by the number of examined sample points:
Where y is the actual value, ŷ is the predicted value, and N is the number of sample points.
In this case, the values are in the units of the predicted feature, which is life expectancy in years. So linear regression has achieved a mean absolute error of 2.29 years. As this is on the training data, we now need to use the test data to do a proper evaluation.
Now we evaluate the models by testing with the test set:
for model in models:predictions = model.predict(X_test)print(type(model).__name__, mean_absolute_error(Y_test, predictions))
Which is the best model? The one with the lowest mean absolute error, which is linear regression. So we will choose this model:
model = models
To get a better feel for how well the model has worked, we can add the predictions and actuals back to the test data, so we can see the quality of the predictions on a country-by-country basis:
predictions = model.predict(X_test)df = X_test.copy()df['Prediction'] = predictionsdf['Actual'] = Y_testdf["Error"] = Y_test - predictionsdf
Look at the three columns on the far right of the above table. These show how close the model is to predicting the actual life expectancy for each country.
Inspect the Models
It's good to look inside the models to understand what was created. Linear regression and KNN use very different ways of building models, and you need to interpret them in different ways.
Interpreting a linear regression model is easy! Use the
linearRegressionSummary() function in
functions.py to generate a chart of the coefficients:
How do you interpret this?
For each unit increase in happiness, there is a 5.0091-year increase in lifeexp.
For each unit increase in income, there is a 9.2097- year increase in lifeexp.
For each unit increase in inequality, there is a 3.3106-year decrease in lifeexp.
Be aware of what a "unit" means in this case. Because we scaled the data using the MinMaxScaler, all features are the scaled units. So one unit of income is the difference between the max income ($120,000), and min income ($623), which is $119,377. If you do the math (119377/9.2097), you can see that this model predicts that each $12,962 increase in average income, increases life expectancy by one year.
Extract the coefficients using this code:
array([ 5.00912637, 9.2096769 , 5.41605897, 2.26122297, -4.13876626, -3.31059309, -5.53211396, -14.32217202, -1.63521978, -2.01476135, 2.22581396, 3.20586791])
Extract the intercept using this code:
Relate these directly back to the formula for a line we used in our exploration of linear regression:
In our case, y is lifeexp, m1 is the first coefficient (for happiness); m2 is the second coefficient (for income), etc.; x1, x2, etc., are the input features (happiness, income, etc.) themselves. Finally, c is the intercept.
The formula for the linear regression line computed by sklearn from this data is:
Remember that the inputs to this formula are the scaled inputs, using the MinMaxScaler.
Interpreting KNN models is a little more tricky. The problem isn't reduced to a simple formula. The algorithm performs a calculation for each individual.
We can get the distances for the k-nearest neighbors for each data point:
(array([[0. , 0.38911668, 0.3929857 , 0.4201251 , 0.51618381],
[0. , 0.20505552, 0.20891422, 0.23226123, 0.26827204],
[0. , 0.12274953, 0.1953064 , 0.22064946, 0.22207317],
[0. , 0.16349608, 0.20891422, 0.22826679, 0.28847733],
[0.13327753, 0.15176936, 0.16216559, 0.16956607, 0.18772587],
[0.10256832, 0.15520456, 0.24629633, 0.24943538, 0.25508778],
[0. , 0.3593436 , 0.38793799, 0.38953126, 0.39317685],
As you can see, there are five values for each data point because that is the default value of k for the KNeighborsRegressor.
We can also get the actual data points that are the nearest neighbors:
g = models.kneighbors_graph(X).toarray()
Inspecting the neighbors of the first data point:
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
Each 1 represents another data point that is considered a neighbor. As you can see, there are five because that is the default value of k for the KNeighborsRegressor.
Build multiple models from the same dataset, and see which one performs the best.
Measure the performance of a regression model using mean absolute error (MAE).
Iterate to improve model performance.
Inspect the coefficients for a linear regression model, and use them to understand the relevance of each input feature to the predictions of the model.
Inspect the information about the neighbors in a KNN model, but these are more difficult to interpret.
You've made it! What an excellent achievement. Before I wish you a final farewell (and hopefully, see you very soon ), I encourage you to take the very last quiz that will test your knowledge about building a regression model. You're well on your way to become a Machine Learning expert! Keep it up.