• 20 hours
  • Medium

Free online content available in this course.



Got it!

Last updated on 5/27/20

Test Your Knowledge on Building Generalized Linear Models!

Log in or subscribe for free to enjoy all this course has to offer!

Evaluated skills

  • Build Generalized Linear Models


In this exercise, we're working on the auto-mpg dataset, and instead of modeling the fuel consumption (mpg), we're going to model the origin of the cars based on their engine characteristics.

Since the origin of the cars can only take three different values, we're on the case of a classification problem.

df = pd.read_csv('auto-mpg.csv')

The origin of the cars is coded as 1, 2, 3. This creates a non natural order. So let's bring back the actual values:

origins = {1: 'American', 2: 'European', 3: 'Japanese'}
df['origin'] = df.origin.apply(lambda o : origins[o] )

Next, we dummify the origin variable to create three new binary categorical variables:

  • European
  • Japanese
  • American


df = df.merge(pd.get_dummies(df.origin), left_index=True, right_index= True )

We now have three new binary variables:

1 245
0 147
Name: American, dtype: int64
0 324
1 68
Name: European, dtype: int64
0 313
1 79
Name: Japanese, dtype: int64

Now we build classification models to predict whether a car is an American or not using the logit function from stastmodel.

  • Question 1

    Compare the results of the logistic regression for the two models, one with six predictors and one with just three.

    M_US0 = 'American ~ mpg + cylinders + displacement + horsepower + weight + acceleration'
    res_US0 = smf.logit(M_US0, data = df).fit()
    M_US1 = 'American ~ cylinders + displacement + weight'
    res_US1 = smf.logit(M_US1, data = df).fit()

    Which one of these two assertions is true?

    • The simpler model MUS_1 is as good as the more complex model MUS_0.

    • The mode complex model significantly increases the R-squared and log-likelihood, and therefore, is better.

  • Question 2

    Consider the model:

    M_US1 = 'American ~ cylinders + displacement + weight'
    res_US1 = smf.logit(M_US1, data = df).fit()

    Calculate the classification probabilities with:

    yhat = res_US1.predict(df)

    And plot the histogram with:

    import matplotlib.pyplot as plt
    plt.hist(yhat, bins = 20)

    What can you conclude?

    • The model leans toward classifying cars as American.


    • The model leans toward classifying cars as non-American.

    • The model is hesitant in its classification and can't really tell American and non-American cars apart.

  • Question 3

    Consider the confusion matrix for the model  M_US1 = 'American ~ cylinders + displacement + weight':

    [[132. 15.]
    [ 30. 215.]]

    Which assertion is true?

    • Thirty non-American cars were identified as American (false positives).

    • Fifteen American cars were identified as non-American (false negatives).


    • The total number of correctly classified cars is 347.

    • All of the above.