Last updated on 6/23/22

# Test Your Knowledge on Linearity, Correlation, and Hypothesis Testing

### Evaluated skills

• Understand the Fundamentals of Statistical Modeling

### Description

In this exercise, you are going to analyze a new dataset in terms of linearity, correlation, and the statistical significance of the means of certain categories and whether some variables follow a normal distribution.

The dataset is the bike sharing dataset available from the UCI repository. This dataset has daily and and hourly data. We are going to work on the day.csv file which has 731 samples and 16 different variables.

And we are going to focus on the following five attributes:

• Season: Season (1:spring, 2:summer, 3:fall, 4:winter)
• Temp: Normalized temperature in Celsius.
• Hum: Normalized humidity.
• Wind speed: Normalized wind speed.
• CNT: Count of total rental bikes.

You can load the dataset with:

``````import pandas as pd
``````

And remove the non essential columns with:

``````df = df[['season', 'temp','hum','windspeed','cnt']]
``````
• ### Question 1

Draw the scatter plots of the variables (use  `sns.pairplot(df)`  from the Seaborn library).

Looking at the scatter plots of the variables, which relation is the most linear looking?

• hum vs. temp

• cnt vs. temp

• wind speed vs. hum

• All of the above.

• ### Question 2

Calculate the correlation of the different variables using the Pearson method.

Which pair of variables are negatively correlated with the number of users (cnt)?

• season & temp

• temp & hum

• wind speed & hum

• temp & wind speed

• ### Question 3

Consider the correlation of the wind speed versus the other variables.

What can you conclude when there's more wind?

Careful, there are several correct answers.
• A slight decrease in the number of people biking.

• A very important decrease in the number of people biking.

• Colder temperatures and less humidity.

• A slight increase in the number of users.