Last updated on 8/5/21
Apply Your Feature Engineering Skills to the Titanic Dataset
- Prepare data with feature engineering techniques
In this exercise, you will analyze the Titanic dataset from Kaggle.
The following is a description of the features in the data:
|survival||Survival||0 = No, 1 = Yes|
|pclass||Ticket class||1 = 1st, 2 = 2nd, 3 = 3rd|
|Age||Age in years|
|sibsp||# of siblings / spouses aboard the Titanic|
|parch||# of parents / children aboard the Titanic|
|embarked||Port of Embarkation||C = Cherbourg, Q = Queenstown, S = Southampton|
A copy of the dataset can be found on the course GitHub repository as titanic.csv. If you download the data from Kaggle, just use the file called train.csv. Let's start by loading the dataset and taking a quick peek at the head:
import pandas as pddf = pd.read_csv("titanic.csv")df.head()
isnull()function to find the columns containing nulls. Which feature contains the most nulls?
The Pclass feature is the ticket class. Use the
value_counts()functions to understand the feature. What would be a good strategy for processing it?
It's a continuous value feature, so keep as an integer.
It's a categorical feature, so convert to text such as Class1, Class2, and Class3.
It's a categorical feature, so use one-hot encoding to convert to dummy variables.
There are too few distinct values, and therefore of little value. Delete it.
Use binning to split the Fare feature into four equal bands based on the quartile boundaries. Call the bands Q1, Q2, Q3, and Q4. Use the
describe()function to determine the quartile boundaries, then use cut() to create the bins. Finally, group on the new binned category, and find the range of values in each band.
What is the minimum and maximum Fare in band Q1?
0 and 7.8958
0 and 7.9104
4.0125 and 7.8958
0 and 7.9250