Last updated on 6/23/22
Test Your Knowledge on Linearity, Correlation, and Hypothesis Testing
- Understand the Fundamentals of Statistical Modeling
In this exercise, you are going to analyze a new dataset in terms of linearity, correlation, and the statistical significance of the means of certain categories and whether some variables follow a normal distribution.
The dataset is the bike sharing dataset available from the UCI repository. This dataset has daily and and hourly data. We are going to work on the day.csv file which has 731 samples and 16 different variables.
And we are going to focus on the following five attributes:
- Season: Season (1:spring, 2:summer, 3:fall, 4:winter)
- Temp: Normalized temperature in Celsius.
- Hum: Normalized humidity.
- Wind speed: Normalized wind speed.
- CNT: Count of total rental bikes.
You can load the dataset with:
import pandas as pddf = pd.read_csv('day.csv')
And remove the non essential columns with:
df = df[['season', 'temp','hum','windspeed','cnt']]
Draw the scatter plots of the variables (use
sns.pairplot(df)from the Seaborn library).
Looking at the scatter plots of the variables, which relation is the most linear looking?
hum vs. temp
cnt vs. temp
wind speed vs. hum
All of the above.
Calculate the correlation of the different variables using the Pearson method.
Which pair of variables are negatively correlated with the number of users (cnt)?
season & temp
temp & hum
wind speed & hum
temp & wind speed
Consider the correlation of the wind speed versus the other variables.
What can you conclude when there's more wind?Careful, there are several correct answers.
A slight decrease in the number of people biking.
A very important decrease in the number of people biking.
Colder temperatures and less humidity.
A slight increase in the number of users.