12 hours
- Medium
Free online content available in this course.
course.header.alt.is_video
course.header.alt.is_certifying
Got it!Last updated on 6/23/22
Test Your Knowledge About Predictive Models!
Evaluated skills
- Build Predictive Models
Description
We are continuing to work on the titanic dataset and consider the logistic regression model:
M = 'survived ~ pclass + sex + sibsp + parch + fare + C(embarked) '
where:
- survived is the binary classification variable of whether the passenger has survived or not
- embarked: the port of embarkation
- pclasss: the travel class
- sex: sex of the passenger
- sibsp: number of siblings
- parch: number of parents
- and fare: the price of the ticket
Let's import the packages and load the dataset:
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
import statsmodels.api as sm
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score
df = pd.read_csv('titanic.csv')
Next we define the model and drop the missing values for the variables in our model.
M = 'survived ~ pclass + sex + sibsp + parch + fare + C(embarked) '
columns = ['survived', 'pclass','sex','sibsp','parch','fare','embarked']
df = df.dropna(subset = columns)
Question 1
Notice that for 4 passengers, the fare variable is much higher (512)
than for the rest of the passengers (average fare = 33.2).print("mean fare: {:.2f}".format(df.fare.mean()))print(df[df.fare > 512])Let's assume these high fares are not outliers or mistakes and that these values should be kept as relevant values.
As usual, we split the dataset of 1309 samples into a train and test subsets.
What happens if these 4 samples are only present in the test subset?
Careful, there are several correct answers.If the fare variable is not an important variable, the model performance won't be affected much.
The model will assume that the data is capped at the maximum present in the test set.
The model will not have encountered fares like that in the training set
which may affect its performances.It's better not to include the fare variable in the model definition
Question 2
Run 3 experiments each time splitting the dataset into a train and test set but with a different seed for the split.
For instance take these values:
seeds = [1,8,17]
Which is the best split in terms of AUC and Accuracy?seeds = [1,8,17]for seed in seeds:# create the train and test subsetnp.random.seed(seed)train_index = df.sample(frac = 0.7).indextrain = df.loc[df.index.isin(train_index)]test = df.loc[~df.index.isin(train_index)]# train the modelresults = smf.logit(M, train).fit()yhat = results.predict(test)# AUC scoreauc = roc_auc_score(test['survived'], yhat)print(" seed: {} AUC: {:.2f} ".format(seed, auc))seed 1
seed 8
seed 17
Impossible to tell
Question 3
Consider the age variable. Notice that there are 263 missing values
df[df.age.isna()].shape
What would be a good strategy to deal with these missing value and still add the age variable to the model?
M = 'survived ~ age + pclass + sex + sibsp + parch + fare + C(embarked) '
Careful, there are several correct answers.input random numbers for missing values
replace the age data by the average of the age for non missing values
replace the missing values with average age of the survivors
replace the missing age data by averages for males and females
- Up to 100% of your training program funded
- Flexible start date
- Career-focused projects
- Individual mentoring