Part 1
Cleanse Your Data Set
Part 2
Represent the Different Types of Variables
Part 3
Perform Univariate Analyses
Part 4
Perform Bivariate Analyses

Part 1
Cleanse Your Data Set
Part 2
Represent the Different Types of Variables
Part 3
Perform Univariate Analyses
Part 4
Perform Bivariate Analyses

Represent Variables in the Form of a Table

There is more to life than histograms!

It is also possible to present variables in the form of a table. It’s not as pretty, of course, but in some cases, this representation is better suited, or supplements, a graphic representation. We will look at four cases, corresponding to the four variable types.

Let's Start With Some Vocabulary

So that the two of us can communicate, you and I, we need to adopt a common language. So we are going to name the different objects we will be manipulating in this chapter.

Here we are working with the bank statement sample, which is made up of transactions. Note that $$$n$$$ will represent the number of banking transactions: this is our sample size.

Next, the variable we will be analyzing will be referred to as $$$X$$$ .

$$$X$$$ is not particularly concrete: it’s just a variable. For example, amount is a variable.

Our data set contains a number of different values for the amount variable: 1.43, 80, 2.20, etc. No more theory now - all practice! We’ve got real values in front of us. The number of values in our sample is $$$n$$$ . So we can express these values as ( $$$x1,...,xn$$$ ).

When referring to variables, it’s best to use a capital letter. However, when referring to observations of the variable, lower case letters are used. Here, ( $$$x1,...,xn$$$ ) is an observed realization of the random variable $$$X$$$ .

Discrete Quantitative and Qualitative Variables

If $$$X$$$ is qualitative (or even discrete quantitative), it can take a number of categories. For example, categ can take the categories “GROCERIES,” “RENT,” “TRANSPORTATION,” etc. We will refer to these categories as { $$$a1,...,ak$$$ }, where $$$k$$$ indicates the number of categories.

Continuous Quantitative Variables

To present continuous quantitative variables, we will group the values of variable X into bins, which will be k in number. These bins will be expressed as follows:

$$$\{[a_1',a_2'[,...,[a_k',a_{k+1}'[\}$$$

Representing Variables in the Form of a Table

Qualitative Variables

For qualitative variables, simply count the number of values for each category. This number is referred to as the occurrence of the category.

So, for a category $$$ai$$$ (where $$$i$$$ is between $$$1$$$ and $$$k$$$ of course!), the occurrence is expressed as $$$n I$$$ . If we add up the occurrences of all of the categories, we get $$$n$$$ : the sample size.

If we divide the number of occurrences by $$$n$$$ , we get the frequency, which is a number between 0 and 1. As I’m sure you’ve guessed, if we add together the frequencies of all of the categories, we get 1!

Here is how a qualitative variable is normally presented formally, using the example of the categ variable:

categ	$$$n$$$	$$$f$$$
OTHER	212	0.688312
GROCERIES	39	0.126623
TRANSPORTATION	21	0.068182
RESTAURANT	16	0.051948
PHONE BILL	7	0.022727
BANK FEE	7	0.022727
RENT	6	0.019481

Quantitative Variables

Discrete Variables

For discrete quantitative variables, we can take the preceding table and add a column to it providing the cumulative frequency. The cumulative frequency of a category ai is simply the sum of the frequencies of all of the categories that are less than or equal to $$$ai$$$ . It is expressed as $$$F$$$ . Here is an example using the quart_month variable:

quart_month	$$$n$$$	$$$f$$$	$$$F$$$
1	86	0.279221	0.279221
2	76	0.246753	0.525974
3	75	0.243506	0.769481
4	71	0.230519	1.000000

Continuous Variables

For continuous variables, simply replace { $$$a1,...,ak$$$ } with bins, as we saw earlier. Here’s what that would look like, using the amount variable:

amount	$$$n$$$	$$$f$$$	$$$F$$$
[...]	[...]	[...]	[...]
[-120.0, -90.0[	2	0.006494	0.048701
[-90.0, -60.0[	11	0.035714	0.084416
[-60.0, -30.0[	28	0.090909	0.175325
[-30.0, 0.0[	237	0.769481	0.944805
[0.0, 30.0[	3	0.009740	0.954545
[...]	[...]	[...]	[...]

Now for the code...

In Python, the code is pretty simple. All you need (almost) is one line of code per column. Here is the code that generated the summary table for the quart_month variable.

occurrences = data["quart_month"].value_counts()
categories = occurrences.index # the occurrences index contains the categories

tab = pd.DataFrame(categories, columns = ["quart_month"]) # creation of table based on categories
tab["n"] = occurrences.values
tab["f"] = tab["n"] / len(data) # len(data) returns the sample size

To calculate the occurrence, we use value_counts() for the variable we want to look at. This method returns a Series object whose values are occurrences, and whose index contains the categories (lines 1 and 2).

Based on our categories, we create the tab table (line 4), to which we add an occurrence column (line 5) and then a frequencies column (line 6).

To calculate cumulative frequencies, you need only 2 lines more. One line sorts the values, the other calculates the sum of the cumulative frequencies:

tab = tab.sort_values("quart_month") # sorts values of variable X (increasing)
tab["F"] = tab["f"].cumsum() # cumsum calculates the cumulative sum

Any feedback to share with us?

Ever considered an OpenClassrooms diploma?

Up to 100% of your training program funded
Flexible start date
Career-focused projects
Individual mentoring

Find the training program and funding option that suits you best

Guide me Compare training types

categ	$$\(n\)$$	$$\(f\)$$
OTHER	212	0.688312
GROCERIES	39	0.126623
TRANSPORTATION	21	0.068182
RESTAURANT	16	0.051948
PHONE BILL	7	0.022727
BANK FEE	7	0.022727
RENT	6	0.019481

quart_month	$$\(n\)$$	$$\(f\)$$	$$\(F\)$$
1	86	0.279221	0.279221
2	76	0.246753	0.525974
3	75	0.243506	0.769481
4	71	0.230519	1.000000

Table of contents

Cleanse Your Data Set

Represent the Different Types of Variables

Perform Univariate Analyses

Perform Bivariate Analyses