In the previous chapter, we manually built a decision tree to classify countries into low and high happiness based on life expectancy and unemployment. We provided pre-classified samples to the model (that was you!), which were used to infer the rules for the decision tree.
In this chapter, we will repeat the same tasks using Python. You may want to check back to that exercise to compare what we did manually against what we do in code.
Start by importing the necessary libraries, including pandas, NumPy, and Matplotlib, to give you data manipulation and visualization capabilities. Then you'll import a few capabilities from scikit-learn (also called sklearn), which is the Python machine learning library we will be using.
Finally, you'll import a library called functions.py, which I have created for this course. You can find all the code for this course on the GitHub repository.
# Import Python libraries for data manipuation and visualizationimport pandas as pdimport numpy as npimport matplotlib.pyplot as pyplot# Import the Python machine learning libraries we needfrom sklearn.model_selection import train_test_splitfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.metrics import accuracy_score# Import some convenience functions. This can be found on the course githubfrom functions import *
1. Define the Task
Remember the task:
Use life expectancy and long-term unemployment rate to predict the perceived happiness (low or high) of inhabitants of a country.
2. Acquire Clean Data
Let’s load the data from the CSV file using pandas:
# Load the data setdataset = pd.read_csv("world_data_really_tiny.csv")
3. Understand the Data
i. Inspect the Data
head() function to show the first 12 rows (which is the entire dataset). Again, this is far too small for any real machine learning activity, but it serves our purpose in this learning exercise.
# Inspect first few rowsdataset.head(12)
Look at the Data Shape
Confirm the number of rows and columns in the data:
# Inspect data shapedataset.shape
Compute Descriptive Statistics
Computing the descriptive stats can be done in one function call:
# Inspect descriptive statsdataset.describe()
ii. Visualize the Data
histPlotAll() function from functions.py to plot a histogram for each numeric feature:
# View univariate histgram plotshistPlotAll(dataset)
boxPlotAll() function from functions.py to plot a box plot for each numeric feature:
# View univariate box plotsboxPlotAll(dataset)
classComparePlot() function from functions.py to plot a comparative histogram for the two classes:
# View class splitclassComparePlot(dataset[["happiness","lifeexp","unemployment"]], 'happiness', plotType='hist')
4. Prepare the Data for Supervised Machine Learning
i. Select Features and Split Into Input and Target Features
We can select happiness as the feature to predict ( y ) and lifeexp and unemployment as the features to make the prediction ( X ):
# Split into input and output featuresy = dataset["happiness"]X = dataset[["lifeexp","unemployment"]]X.head()
5. Build a Model
i. Split Into Training and Test Sets
train_test_split() function in sklearn to split the sample set into a training set, which we will use to train the model, and a test set, to evaluate the model:
# Split into test and training setstest_size = 0.33seed = 7X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)
Note that I have requested ⅓ of the data be held back as the test set. I have also set the random_state parameter to a seed of seven. This requests a random sampling of the data but ensures we can always get the same random sampling if we rerun the experiment. When we tweak the model, it ensures that any changes are due to the tweaking, and not to the random sampling being different. You can change the random sample by changing the seed to another integer value.
Let's look at the four samples produced and confirm the randomness of the selection:
ii. Select an Algorithm
Create a model using the sklearn decision tree algorithm:
# Select algorithmmodel = DecisionTreeClassifier()
iii. Fit the Model to the Data
Now take the training set and use it to fit the model (i.e., train the model):
# Fit model to the datamodel.fit(X_train, y_train)
iv. Check the Model
Next, assess how well the model predicts happiness using the training data, by “pouring” training set X into the decision tree:
# Check model performance on training datapredictions = model.predict(X_train)print(accuracy_score(y_train, predictions))
The model has performed really well - 100%!
6. Evaluate the Model
i. Compute Accuracy Score
Let’s pour test set X into the decision tree and see what it predicts:
# Evaluate the model on the test datapredictions = model.predict(X_test)
Look at the predictions it has made:
array(['Low', 'High', 'High', 'Low'], dtype=object)
And compute the accuracy score:
We can show the model predictions with the original data, with the actual happiness value:
df = X_test.copy()df['Actual'] = y_testdf['Prediction'] = predictionsdf
What Rules Did Sklearn Come Up With?
At this point, the model produced by sklearn is a bit of a black box. There is no set of rules to examine, but it is possible to inspect the model and visualize the rules.
For this code to work, you need to first install Graphviz by running the following from a terminal session:
conda install python-graphviz
You can then run the following code, which uses a function from functions.py:
And see the decision tree rules:
You can see that the first rule is to check if unemployment is below or equal to 1.48 , and if so, classify the sample point as high. The next rule checks if lifeexp is below or equal to 75.35, and if, so classify the sample point as low. The third and final rule checks if unemployment is below or equal to 5.525, and if so, classify it as high. Any other sample points get classified as low.
There are probably a couple of questions in your mind:
What is Gini?
How did sklearn come up with the rules and specifically the boundary values like 5.525?
We will examine the construction of decision trees and answer these questions in the next chapter.
You built your first machine learning model! It’s not a great one, mainly because the dataset was so small. You’ve learned the process for building such models, and how a simple algorithm - decision tree - can be used to perform a classification task.
In the next part, we will build something more solid by exploring some other techniques. But for now, it's time to test your knowledge!