• 8 hours
  • Medium

Free online content available in this course.

course.header.alt.is_video

course.header.alt.is_certifying

Got it!

Last updated on 8/5/21

Build a Classification Model With the Titanic Dataset

Evaluated skills

  • Build a supervised learning model to address a classification task

Description

In this exercise, you will carry out a classification using the Titanic dataset from Kaggle. We used this dataset in the feature engineering exercise in Part 2.

https://www.kaggle.com/c/titanic

The following is a description of the features in the data:

Feature Definition Key
survival Survival 0 = No, 1 = Yes
pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex Sex  
Age Age in years  
sibsp # of siblings / spouses aboard the Titanic  
parch # of parents / children aboard the Titanic  
ticket Ticket number  
fare Passenger fare  
cabin Cabin number  
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

You can find the code and data for this activity on the course GitHub repository.

The file  titanic_clean.csv  contains data that has already been cleaned up as follows:

  • Nulls in Age have been imputed with the mean age.
  • The first letter of the cabin has been split to provide a new deck feature.  This has then been one-hot encoded, with nulls going to a column Deck_nan.
  • Sex has been one-hot encoded.
  • Embarked has been one-hot encoded, with nulls going to a column Embarked_nan.

 

The Jupyter Notebook  classification_activity.ipynb  contains a template for the code you will run.  Open the template in Jupyter and write the code as guided within the template.  The objective is to predict the survival of passengers based on the available features.

Then answer the following questions.

  • Question 1

    In the correlation visualization, select the two features below that have the most significant correlation to the target feature, Survived.

    Careful, there are several correct answers.
    • Sex

    • Age

    • Pclass

    • Sibsp

  • Question 2

    Which feature should be selected for the target?

    • Fare

    • Survived

    • Age

    • Sex

  • Question 3

    After scaling with the  MinMaxScaler, which of the following are correct statements about the data?

    Careful, there are several correct answers.
    • All features have a mean of 0.5.

    • All features have a min of 0.

    • All features have a max of 1.

    • The features are sorted from lowest to highest importance.

Ever considered an OpenClassrooms diploma?
  • Up to 100% of your training program funded
  • Flexible start date
  • Career-focused projects
  • Individual mentoring
Find the training program and funding option that suits you best