Mis à jour le 04/10/2021

## Transfer your data from NumPy to Pandas

Connectez-vous ou inscrivez-vous gratuitement pour bénéficier de toutes les fonctionnalités de ce cours ! Welcome to the second part of the course!

In the next three chapters, we are going to dive into another Python Library: Pandas!

Together with  `NumPy `  and  `Matplotlib`  ,  `Pandas`  is one of the basic libraries for data science in Python.  `Pandas`  provides powerful and easy-to-use data structures, as well as functions  to quickly operate on these structures.

In this chapter, we will look at operations used on the most commonly used object of this library, the DataFrame. Let's start by using our imagination....

We can characterize a bear by its size. Or rather, by its sizes: say, for example, the size of its legs, the average length of its fur, the size of its tail and the diameter of its belly.

This bear can be represented by a NumPy table:

``````import numpy as np
a_bear_numpy = np.array([100,5,20,80])
a_bear_numpy
``````
`array([100,  5, 20, 80])`

Here, our bear has legs 100cm long, fur that is 5cm on average, a 20cm tail and a belly that is 80cm in diameter.

...And if I want to represent several bears, then I create a list of  `np.array`  ?

Yes! You can do the following:

``````bear_family = [
np.array([100, 5  , 20, 80]),
# Bear mom
np.array([50 , 2.5, 10, 40]),
# Bear baby
np.array([110, 6  , 22, 80]),
]
``````

But as we saw in the previous part, we can improve on this because our list of 3 bears is actually a multidimensional table, which NumPy manages very well.

``````bear_family = [
[100, 5  , 20, 80],
# Bear mom
[50 , 2.5, 10, 40],
# Bear baby
[110, 6  , 22, 80],
]

bear_family_numpy = np.array(bear_family)
bear_family_numpy
``````
```array([[ 100. , 5. , 20. , 80. ],
[  50. , 2.5, 10. , 40. ],
[ 110. , 6. , 22. , 80. ]])```

So what's the point of all this?

Well, suppose I want to know the length of the legs (located in position 0 in each list that describes a bear) of the bear located in position 2 in my list (papa bear, because we start at 0). NumPy offers a simple way to do this.

``````bear_family_numpy[2, 0]
``````
`110.0`

What if I want to know the leg sizes of my entire bear family?

Easy-peasy! Just delete the 2 (which corresponded to the papa bear), and replace it with the character : meaning that I want ALL bears!

``````bear_family_numpy[:, 0]
``````
`array([ 100., 50., 110.])`

Yes! That's it! The length of the mother's legs is 100cm, the baby's is 50cm, and the father's is 110cm.

It's certainly quite practical, but writing `bear_family_numpy[:, 0]`  when you want to know the size of the bears' legs, is not very explicit. I would be nice to specify somewhere that the 0 corresponds to leg size, no?

This brings us to Pandas (the library, not the animal)!

#### Everything is a table!

Here is the way I wrote the  `bear_family`  variable:

``````bear_family = [
[100, 5  , 20, 80],
[50 , 2.5, 10, 40],
[110, 6  , 22, 80],]
``````

It looks a bit like a table with rows and columns, don't you think?

 leg hair tail belly bear_mom 100 5 20 80 bear_baby 50 2.5 10 40 bear_dad 110 6 22 80

In this example, why not use a Python library to handle tables like this one? This library exists, and it's called Pandas.

Let's explore Pandas in more detail.

``````import pandas as pd
bear_family_df = pd.DataFrame(bear_family)
bear_family_df
``````

The class we use for representing tables is called  `DataFrame`. To express this visually(and to give it data), we send it a list of rank 2, i.e. a list of lists.

We can specify column and row names. And, as the Pandas library is largely based on the NumPy library in its internal operation, we can even  transmit data in  `ndarray`  format to the DataFrame object:

``````bear_family_df = pd.DataFrame(bear_family_numpy,
columns = ['leg', 'hair', 'tail', 'belly']
)

bear_family_df
``````

You may have noticed that the DataFrame object is very similar to concepts found outside the framework of the Python language, such as:

• Tables of relational databases (such as MySQL, PostgreSQL, etc.) that are manipulated using SQL language

• The Dataframe object on which the entire R language is based, a language intended for statisticians.

So, if you already know SQL or R, you will find Pandas DataFrames very easy to use! It makes data manipulation much more user friendly.

#### Digging into DataFrames

Here, we are going to take a look at the indexing and slicing functionalities provided by DataFrames.

Imagine we want to access the `belly`  column of our table. There are two possible syntaxes, which return exactly the same result:

``````bear_family_df.belly
bear_family_df["belly"]
``````

We can also go through all the bears one by one, thanks to the  `iterrows` method. This returns (at each iteration of the  `for` loop) a tuple whose first element is the index of the line, and the second the content of the line in question:

``````for ind_row, content_row in bear_family_df.iterrows():
print("Here is %s bear:" % ind_row)
print(content_row)
print("--------------------")
``````

Let's now access dad bear: first by his position (2), then by his name "dad". The result returned is exactly the same in both cases.

``````bear_family_df.iloc
# iloc is the positional index
# loc is the label-based index
``````

Let's find out which bear has a belly diameter of 80cm:

``````bear_family_df["belly"] == 80
``````

The result of this operation is very useful for filtering lines! For example, to select only bears with a belly of 80cm, it is sufficient to merge this previous result with a mask, as in this case :

``````mask = bear_family_df["belly"] == 80
# Or more commonly :
bears_80 = bear_family_df[bear_family_df["belly"] == 80]

bears_80
``````

In real life, when you wear a mask, it hides some parts of face while leaving your eyes, nose and mouth visible. Masks within Pandas act in a similar way: they keep only some DataFrame lines and hide others. In Pandas, a mask is a list of Boolean variables (`True`  or `False`) in which each element is associated with a line of the DataFrame. If this element is  `True`, we will want to keep the line in question. If we don't want to keep the line, then this element is  `False`.

To invert the mask, you'll need to use the operator `~`  and then select the bears that don't have a belly size of 80cm:

``````bear_family_df[~mask]
``````
##### Adding new data to a DataFrame

Now, learn how to add new data to a `DataFrame` . There are several ways to do this but let's look at the simplest one:  assembling two DataFrames together.

``````some_bears = pd.DataFrame([[105,4,19,80],[100,5,20,80]],
# two new bears
columns = bear_family_df.columns)
# same columns as bear_family_df
all_bears = bear_family_df.append(some_bears)
all_bears
``````

In the DataFrame `all_bears`, there are duplicates. Indeed, the first bear (mom) and the last bear (whose index is 1) have exactly the same measurements. If we wanted to duplicate, we can do this:

``````all_bears.drop_duplicates()
``````
##### Adding a new columns to a DataFrame
``````# get names of columns
bear_family_df.columns

# create a new column, containing strings
bear_family_df["sex"] = ["f", "f", "m"]
# mom and baby are female, dad is male

# get the number of rows:
len(bear_family_df)

# get the number of distinct values for a columns
bear_family_df.belly.unique()
``````

#### Load a CSV file with Pandas

A CSV file (comma separated values) is a file used to represent data in table form. If you use a spreadsheet software, it can more than likely export in CSV format!

Pandas specializes in the manipulation of tables. Reading a CSV file with Pandas is therefore, child's play: it only takes one line to create a DataFrame from a CSV:

``````data = pd.read_csv("data.csv", sep=";")
``````

That's it! The variable  `data`  now has a DataFrame containing the CSV file data.  