• 12 hours
  • Medium

Free online content available in this course.

course.header.alt.is_video

course.header.alt.is_certifying

Got it!

Last updated on 6/23/22

Test Your Knowledge on Building Linear Regression Models!

Evaluated skills

  • Build Linear Regression Models

Description

In this quiz, we're going to build several models based on the bike-sharing dataset.

The bike -sharing dataset is available from the UCI repository.

The dataset has over 17k samples and 16 different variables. We are going to focus on the following five attributes:

  • Season: season (1:spring, 2:summer, 3:fall, 4:winter).
  • Temp: Normalized temperature in Celsius. 
  • Hum: Normalized humidity. 
  • Wind speed: Normalized wind speed. 
  • Cnt: count of total rental bikes. 

You can load the dataset with:

import pandas as pd
df = pd.read_csv('bike_sharing_day.csv')

Remove the non-essential columns with:

df = df[['season', 'temp','hum','windspeed','cnt']]

To do this quiz, you should first import the following packages:

import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
import statsmodels.api as sm
import numpy as np
import pandas as pd
  • Question 1

    Let's find out which predictor is driving the usage of bikes (the cnt outcome variable). First, build the three univariate regression models:

    1. cnt ~ temp
    2. cnt ~ hum
    3. cnt ~ wind speed

    Looking at the R-squared metric, which variable explains the most the variability in the outcome cnt?

    • Temp

    • Hum

    • Wind speed

  • Question 2

    Look at the influence of each predictor for the different seasons.
    The seasons are defined as:

    seasons = {1:'spring', 2:'summer', 3:'fall', 4:'winter'}
    

    You can limit the regression to a specific season, for instance, spring, with the following line:

    res = smf.ols(formula, data = df[df.season == 1]).fit()
    

    Looking at the R-squared for each season and univariate model:

    • cnt ~ temp
    • cnt ~ hum
    • cnt ~ windspeed

    Which if the following assertion below is true?

    Don't hesitate to loop over the seasons and the predictors with the following code:

    seasons = {1:'spring', 2:'summer', 3:'fall', 4:'winter'}
    
    for season in range(1,5):
        print("--"* 20)
        print("season {}".format(seasons[season]))
        for variable in ['temp', 'hum', 'windspeed']:
            formula = "cnt ~ {}".format(variable)
            res = smf.ols(formula, data = df[df.season == season]).fit()
            print("- R^2 for {}: {:.2f}".format(variable, res.rsquared))
    
    • Temp has the most influence in spring.

    • Humidity is the most important factor in the fall.

       

    • Wind speed always has very little influence on the usage.

    • All of the above.

  • Question 3

    When looking at the data over all four seasons, all p-values are well below 0.05, and the three predictors are relevant.
    However, when selecting a specific season, some predictors are no longer significant.

    Consider the p-values of the predictors in each univariate model for each season:

    • cnt ~ temp
    • cnt ~ hum
    • cnt ~ windspeed

    Which assertion is true?

    • Humidity is never significant.

    • Temperature is not significant in the fall.

       

    • Wind speed is always a significant factor.

    • None of the predictors are significant in winter.

Ever considered an OpenClassrooms diploma?
  • Up to 100% of your training program funded
  • Flexible start date
  • Career-focused projects
  • Individual mentoring
Find the training program and funding option that suits you best