• 12 hours
  • Medium

Free online content available in this course.

course.header.alt.is_video

course.header.alt.is_certifying

Got it!

Last updated on 6/23/22

Test Your Knowledge About Predictive Models!

Evaluated skills

  • Build Predictive Models

Description

We are continuing to work on the titanic dataset and consider the logistic regression model:

M = 'survived ~ pclass + sex + sibsp + parch + fare + C(embarked) '

where:

  • survived is the binary classification variable of whether the passenger has survived or not
  • embarked: the port of embarkation
  • pclasss: the travel class
  • sex: sex of the passenger
  • sibsp: number of siblings
  • parch: number of parents
  • and fare: the price of the ticket

Let's import the packages and load the dataset:

import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
import statsmodels.api as sm
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score
df = pd.read_csv('titanic.csv')

Next we define the model and drop the missing values for the variables in our model.

M = 'survived ~ pclass + sex + sibsp + parch + fare + C(embarked) '
columns = ['survived', 'pclass','sex','sibsp','parch','fare','embarked']
df = df.dropna(subset = columns)
  • Question 1

    Notice that for 4 passengers, the fare variable is much higher (512)
    than for the rest of the passengers (average fare = 33.2).

    print("mean fare: {:.2f}".format(df.fare.mean()))
    print(df[df.fare > 512])

    Let's assume these high fares are not outliers or mistakes and that these values should be kept as relevant values.

    As usual, we split the dataset of 1309 samples into a train and test subsets.

    What happens if these 4 samples are only present in the test subset?

    Careful, there are several correct answers.
    • If the fare variable is not an important variable, the model performance won't be affected much.

    • The model will assume that the data is capped at the maximum present in the test set.

       

    • The model will not have encountered fares like that in the training set
      which may affect its performances.

    • It's better not to include the fare variable in the model definition

  • Question 2

    Run 3 experiments each time splitting the dataset into a train and test set but with a different seed for the split.

    For instance take these values:  seeds = [1,8,17]
    Which is the best split in terms of AUC and Accuracy? 

    seeds = [1,8,17]
    for seed in seeds:
    # create the train and test subset
    np.random.seed(seed)
    train_index = df.sample(frac = 0.7).index
    train = df.loc[df.index.isin(train_index)]
    test = df.loc[~df.index.isin(train_index)]
    # train the model
    results = smf.logit(M, train).fit()
    yhat = results.predict(test)
    # AUC score
    auc = roc_auc_score(test['survived'], yhat)
    print(" seed: {} AUC: {:.2f} ".format(seed, auc))
    • seed 1

    • seed 8

    • seed 17

    • Impossible to tell

  • Question 3

    Consider the age variable. Notice that there are 263 missing values  df[df.age.isna()].shape

    What would be a good strategy to deal with these missing value and still add the age variable to the model?

    M = 'survived ~ age + pclass + sex + sibsp + parch + fare + C(embarked) '

     

    Careful, there are several correct answers.
    • input random numbers for missing values

    • replace the age data by the average of the age for non missing values

    • replace the missing values with average age of the survivors

    • replace the missing age data by averages for males and females

Ever considered an OpenClassrooms diploma?
  • Up to 100% of your training program funded
  • Flexible start date
  • Career-focused projects
  • Individual mentoring
Find the training program and funding option that suits you best