• 8 hours
  • Medium

Free online content available in this course.

course.header.alt.is_video

course.header.alt.is_certifying

Got it!

Last updated on 8/5/21

Apply Your Feature Engineering Skills to the Titanic Dataset

Log in or subscribe for free to enjoy all this course has to offer!

Evaluated skills

  • Prepare data with feature engineering techniques

Description

In this exercise, you will analyze the Titanic dataset from Kaggle.

The following is a description of the features in the data:

Feature Definition Key
survival Survival 0 = No, 1 = Yes
pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex Sex  
Age Age in years  
sibsp # of siblings / spouses aboard the Titanic  
parch # of parents / children aboard the Titanic  
ticket Ticket number  
fare Passenger fare  
cabin Cabin number  
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

A copy of the dataset can be found on the course GitHub repository as titanic.csv.  If you download the data from Kaggle, just use the file called train.csv.  Let's start by loading the dataset and taking a quick peek at the head:

import pandas as pd
df = pd.read_csv("titanic.csv")
df.head()

 

  • Question 1

    Use the  isnull()  function to find the columns containing nulls. Which feature contains the most nulls?

    • Age

    • Cabin

    • Fare

    • Embarked

  • Question 2

    The Pclass feature is the ticket class. Use the  unique() and  value_counts()  functions to understand the feature. What would be a good strategy for processing it?

    • It's a continuous value feature, so keep as an integer.

    • It's a categorical feature, so convert to text such as Class1, Class2, and Class3.

    • It's a categorical feature, so use one-hot encoding to convert to dummy variables.

    • There are too few distinct values, and therefore of little value. Delete it.

  • Question 3

    Use binning to split the Fare feature into four equal bands based on the quartile boundaries. Call the bands Q1, Q2, Q3, and Q4. Use the  describe()  function to determine the quartile boundaries, then use cut() to create the bins.  Finally, group on the new binned category, and find the range of values in each band.

    What is the minimum and maximum Fare in band Q1?

    • 0 and 7.8958

    • 0 and 7.9104

    • 4.0125 and 7.8958

    • 0 and 7.9250