Last updated on 6/23/22

### Evaluated skills

• Build Predictive Models

### Description

We are continuing to work on the titanic dataset and consider the logistic regression model:

`M = 'survived ~ pclass + sex + sibsp + parch + fare + C(embarked) '`

where:

• survived is the binary classification variable of whether the passenger has survived or not
• embarked: the port of embarkation
• pclasss: the travel class
• sex: sex of the passenger
• sibsp: number of siblings
• parch: number of parents
• and fare: the price of the ticket

Let's import the packages and load the dataset:

``````import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
import statsmodels.api as sm
import numpy as np
import pandas as pd

from sklearn.metrics import roc_auc_score

``````

Next we define the model and drop the missing values for the variables in our model.

``````M = 'survived ~ pclass + sex + sibsp + parch + fare + C(embarked) '
columns = ['survived', 'pclass','sex','sibsp','parch','fare','embarked']
df = df.dropna(subset = columns)
``````
• ### Question 1

Notice that for 4 passengers, the fare variable is much higher (512)
than for the rest of the passengers (average fare = 33.2).

``````print("mean fare: {:.2f}".format(df.fare.mean()))
print(df[df.fare > 512])
``````

Let's assume these high fares are not outliers or mistakes and that these values should be kept as relevant values.

As usual, we split the dataset of 1309 samples into a train and test subsets.

What happens if these 4 samples are only present in the test subset?

Careful, there are several correct answers.
• If the fare variable is not an important variable, the model performance won't be affected much.

• The model will assume that the data is capped at the maximum present in the test set.

• The model will not have encountered fares like that in the training set
which may affect its performances.

• It's better not to include the fare variable in the model definition

• ### Question 2

Run 3 experiments each time splitting the dataset into a train and test set but with a different seed for the split.

For instance take these values:  `seeds = [1,8,17]`
Which is the best split in terms of AUC and Accuracy?

``````seeds = [1,8,17]

for seed in seeds:
# create the train and test subset
np.random.seed(seed)
train_index = df.sample(frac = 0.7).index
train = df.loc[df.index.isin(train_index)]
test = df.loc[~df.index.isin(train_index)]
# train the model
results = smf.logit(M, train).fit()
yhat    = results.predict(test)
# AUC score
auc = roc_auc_score(test['survived'], yhat)
print(" seed: {} AUC: {:.2f}  ".format(seed, auc))

``````
• seed 1

• seed 8

• seed 17

• Impossible to tell

• ### Question 3

Consider the age variable. Notice that there are 263 missing values  `df[df.age.isna()].shape`

What would be a good strategy to deal with these missing value and still add the age variable to the model?

`M = 'survived ~ age + pclass + sex + sibsp + parch + fare + C(embarked) '`

Careful, there are several correct answers.
• input random numbers for missing values

• replace the age data by the average of the age for non missing values

• replace the missing values with average age of the survivors

• replace the missing age data by averages for males and females