Welcome to the second part of the course!
In the next three chapters, we are going to dive into another Python Library: Pandas!
Together with NumPy
and Matplotlib
, Pandas
is one of the basic libraries for data science in Python. Pandas
provides powerful and easy-to-use data structures, as well as functions to quickly operate on these structures.
In this chapter, we will look at operations used on the most commonly used object of this library, the DataFrame. Let's start by using our imagination....
We can characterize a bear by its size. Or rather, by its sizes: say, for example, the size of its legs, the average length of its fur, the size of its tail and the diameter of its belly.
This bear can be represented by a NumPy table:
import numpy as np
a_bear_numpy = np.array([100,5,20,80])
a_bear_numpy
array([100, 5, 20, 80])
Here, our bear has legs 100cm long, fur that is 5cm on average, a 20cm tail and a belly that is 80cm in diameter.
...And if I want to represent several bears, then I create a list of np.array
?
Yes! You can do the following:
bear_family = [
np.array([100, 5 , 20, 80]),
# Bear mom
np.array([50 , 2.5, 10, 40]),
# Bear baby
np.array([110, 6 , 22, 80]),
# Bear dad
]
But as we saw in the previous part, we can improve on this because our list of 3 bears is actually a multidimensional table, which NumPy manages very well.
bear_family = [
[100, 5 , 20, 80],
# Bear mom
[50 , 2.5, 10, 40],
# Bear baby
[110, 6 , 22, 80],
# Bear dad
]
bear_family_numpy = np.array(bear_family)
bear_family_numpy
array([[ 100. , 5. , 20. , 80. ], [ 50. , 2.5, 10. , 40. ], [ 110. , 6. , 22. , 80. ]])
So what's the point of all this?
Well, suppose I want to know the length of the legs (located in position 0 in each list that describes a bear) of the bear located in position 2 in my list (papa bear, because we start at 0). NumPy offers a simple way to do this.
bear_family_numpy[2, 0]
110.0
What if I want to know the leg sizes of my entire bear family?
Easy-peasy! Just delete the 2 (which corresponded to the papa bear), and replace it with the character : meaning that I want ALL bears!
bear_family_numpy[:, 0]
array([ 100., 50., 110.])
Yes! That's it! The length of the mother's legs is 100cm, the baby's is 50cm, and the father's is 110cm.
It's certainly quite practical, but writing bear_family_numpy[:, 0]
when you want to know the size of the bears' legs, is not very explicit. I would be nice to specify somewhere that the 0 corresponds to leg size, no?
This brings us to Pandas (the library, not the animal)!
Get to know Pandas
Everything is a table!
Here is the way I wrote the bear_family
variable:
bear_family = [
[100, 5 , 20, 80],
[50 , 2.5, 10, 40],
[110, 6 , 22, 80],]
It looks a bit like a table with rows and columns, don't you think?
| leg | hair | tail | belly |
bear_mom | 100 | 5 | 20 | 80 |
bear_baby | 50 | 2.5 | 10 | 40 |
bear_dad | 110 | 6 | 22 | 80 |
In this example, why not use a Python library to handle tables like this one? This library exists, and it's called Pandas.
Let's explore Pandas in more detail.
import pandas as pd
bear_family_df = pd.DataFrame(bear_family)
bear_family_df
The class we use for representing tables is called DataFrame
. To express this visually(and to give it data), we send it a list of rank 2, i.e. a list of lists.
We can specify column and row names. And, as the Pandas library is largely based on the NumPy library in its internal operation, we can even transmit data in ndarray
format to the DataFrame object:
bear_family_df = pd.DataFrame(bear_family_numpy,
index = ['mom', 'baby', 'dad'],
columns = ['leg', 'hair', 'tail', 'belly']
)
bear_family_df
You may have noticed that the DataFrame object is very similar to concepts found outside the framework of the Python language, such as:
Tables of relational databases (such as MySQL, PostgreSQL, etc.) that are manipulated using SQL language
The Dataframe object on which the entire R language is based, a language intended for statisticians.
So, if you already know SQL or R, you will find Pandas DataFrames very easy to use! It makes data manipulation much more user friendly.
Digging into DataFrames
Here, we are going to take a look at the indexing and slicing functionalities provided by DataFrames.
Imagine we want to access the belly
column of our table. There are two possible syntaxes, which return exactly the same result:
bear_family_df.belly
bear_family_df["belly"]
We can also go through all the bears one by one, thanks to the iterrows
method. This returns (at each iteration of the for
loop) a tuple whose first element is the index of the line, and the second the content of the line in question:
for ind_row, content_row in bear_family_df.iterrows():
print("Here is %s bear:" % ind_row)
print(content_row)
print("--------------------")
Let's now access dad bear: first by his position (2), then by his name "dad". The result returned is exactly the same in both cases.
bear_family_df.iloc[2]
# iloc is the positional index
bear_family_df.loc["dad"]
# loc is the label-based index
Let's find out which bear has a belly diameter of 80cm:
bear_family_df["belly"] == 80
The result of this operation is very useful for filtering lines! For example, to select only bears with a belly of 80cm, it is sufficient to merge this previous result with a mask, as in this case :
mask = bear_family_df["belly"] == 80
bears_80 = bear_family_df[mask]
# Or more commonly :
bears_80 = bear_family_df[bear_family_df["belly"] == 80]
bears_80
What is a mask?
In real life, when you wear a mask, it hides some parts of face while leaving your eyes, nose and mouth visible. Masks within Pandas act in a similar way: they keep only some DataFrame lines and hide others. In Pandas, a mask is a list of Boolean variables (True
or False
) in which each element is associated with a line of the DataFrame. If this element is True
, we will want to keep the line in question. If we don't want to keep the line, then this element is False
.
To invert the mask, you'll need to use the operator ~
and then select the bears that don't have a belly size of 80cm:
bear_family_df[~mask]
Adding new data to a DataFrame
Now, learn how to add new data to a DataFrame
. There are several ways to do this but let's look at the simplest one: assembling two DataFrames together.
some_bears = pd.DataFrame([[105,4,19,80],[100,5,20,80]],
# two new bears
columns = bear_family_df.columns)
# same columns as bear_family_df
all_bears = bear_family_df.append(some_bears)
all_bears
In the DataFrame all_bears
, there are duplicates. Indeed, the first bear (mom) and the last bear (whose index is 1) have exactly the same measurements. If we wanted to duplicate, we can do this:
all_bears.drop_duplicates()
Adding a new columns to a DataFrame
# get names of columns
bear_family_df.columns
# create a new column, containing strings
bear_family_df["sex"] = ["f", "f", "m"]
# mom and baby are female, dad is male
# get the number of rows:
len(bear_family_df)
# get the number of distinct values for a columns
bear_family_df.belly.unique()
Load a CSV file with Pandas
A CSV file (comma separated values) is a file used to represent data in table form. If you use a spreadsheet software, it can more than likely export in CSV format!
Pandas specializes in the manipulation of tables. Reading a CSV file with Pandas is therefore, child's play: it only takes one line to create a DataFrame from a CSV:
data = pd.read_csv("data.csv", sep=";")
That's it! The variable data
now has a DataFrame containing the CSV file data.