- 12 hours
- Medium

Free online content available in this course.

course.header.alt.is_video

course.header.alt.is_certifying

Got it!Last updated on 5/27/20

# Test Your Knowledge on Building Generalized Linear Models!

### Evaluated skills

- Build Generalized Linear Models

### Description

In this exercise, we're working on the auto-mpg dataset, and instead of modeling the fuel consumption (*mpg*), we're going to model the *origin* of the cars based on their engine characteristics.

Since the origin of the cars can only take three different values, we're on the case of a classification problem.

```
df = pd.read_csv('auto-mpg.csv')
```

The origin of the cars is coded as 1, 2, 3. This creates a non natural order. So let's bring back the actual values:

```
origins = {1: 'American', 2: 'European', 3: 'Japanese'}
df['origin'] = df.origin.apply(lambda o : origins[o] )
```

Next, we dummify the origin variable to create three new binary categorical variables:

- European
- Japanese
- American

With:

```
df = df.merge(pd.get_dummies(df.origin), left_index=True, right_index= True )
```

We now have three new binary variables:

```
print(df['American'].value_counts())
1 245
0 147
Name: American, dtype: int64
print(df['European'].value_counts())
0 324
1 68
Name: European, dtype: int64
print(df['Japanese'].value_counts())
0 313
1 79
Name: Japanese, dtype: int64
```

Now we build classification models to predict whether a car is an American or not using the logit function from stastmodel.

### Question 1

Compare the results of the logistic regression for the two models, one with six predictors and one with just three.

M_US0 = 'American ~ mpg + cylinders + displacement + horsepower + weight + acceleration'res_US0 = smf.logit(M_US0, data = df).fit()M_US1 = 'American ~ cylinders + displacement + weight'res_US1 = smf.logit(M_US1, data = df).fit()Which one of these two assertions is true?

The simpler model MUS_1 is as good as the more complex model MUS_0.

The mode complex model significantly increases the R-squared and log-likelihood, and therefore, is better.

### Question 2

Consider the model:

M_US1 = 'American ~ cylinders + displacement + weight'res_US1 = smf.logit(M_US1, data = df).fit()Calculate the classification probabilities with:

yhat = res_US1.predict(df)And plot the histogram with:

import matplotlib.pyplot as pltplt.hist(yhat, bins = 20)What can you conclude?

The model leans toward classifying cars as American.

The model leans toward classifying cars as non-American.

The model is hesitant in its classification and can't really tell American and non-American cars apart.

### Question 3

Consider the confusion matrix for the model

`M_US1 = 'American ~ cylinders + displacement + weight'`

:print(res_US1.pred_table())[[132. 15.][ 30. 215.]]Which assertion is true?

Thirty non-American cars were identified as American (false positives).

Fifteen American cars were identified as non-American (false negatives).

The total number of correctly classified cars is 347.

All of the above.