Use Python Libraries for Data Science

8 heures
Moyenne

Ce cours est visible gratuitement en ligne.

course.header.alt.is_certifying

J'ai tout compris !

Mis à jour le 22/01/2024

Manipulate a DataFrame

Chapter banner

We’ve imported our first DataFrame, but it certainly won’t be our last! Once we’ve successfully imported our data, the next logical step would be to start manipulating it to suit our purposes.

Here’s a real-life example. We want to access a list of email addresses for all customers who have taken out a loan with us. We’ll use the customers DataFrame we imported in the last chapter to create an email marketing list for our special offers. How can we do this using Pandas? Well, that’s what we’re about to find out!

Navigate the DataFrame

Each column in a DataFrame has a distinct name, which makes it easier to understand and lets you access a particular column using its name.

To access a column in a DataFrame, you just need to use the dataframe_name[column_name] syntax .

So, here’s how you would access a list of email addresses from the customers DataFrame we imported in the previous chapter:

customers['email']

Here we’re accessing the email variable in our customers DataFrame. The syntax makes it clear to see what we’re trying to access.

We might even decide to store the email variable data in a different variable within our code to help us with our mailing list mentioned above. Here, I’ve created an email variable, which I’ll use to store all the email addresses from my customer database:

email = customers['email']

But what if we want to access more than one column?

Well, all you have to do is store the set of columns you want to access in a list. Here’s an example of how we can select the name and email address for our customers:

variables = ['name', 'email']
customers[variables]

I know there are only two lines of code here, but let’s just take a closer look:

We want to access the name and email address for all of our customers. So, the first thing we do is create a list called variables , which we’ll use to store the set of variable names.
We use the same syntax as before to select the list of variables defined within our DataFrame.

You’ll see that our set of names and email addresses are displayed correctly on the screen.

It is, of course, possible to perform these two steps in a single command: customers[['name', 'email]] .

You’ll notice that the variables display in the order we specified (in this case, name and then email) rather than the order in which they were originally provided (email is before name in our customers DataFrame). If you check the initial DataFrame, you’ll see that the original sequence has been retained. So, the selection process allows us to resequence the columns however we like, without changing the original DataFrame.

Discover the Pandas Series Object

Before we go any further, I’d like to ask you a simple question. In your opinion, what objects have we been working with so far in this chapter? You’re starting to understand the DataFrame object, but what about a column within a DataFrame? That’s not a DataFrame. Well, it’s time to unveil the second Pandas object that you’re going to use a great deal and it’s one that’s closely linked to a DataFrame. The Series.

Each column in your DataFrame is in fact a Series object type. You can check this yourself using the type function:

We need to ensure that we understand the difference between these objects. Even though they have many methods in common, there are some which are only used by one or the other and this can lead to errors when carrying out data analysis in Python.

Let’s take a look at the following display:

Example of a customer DataFrame consisting of 150 rows by 2 columns with 2 headers, which are name and email. — DataFrame example

We have quite a pleasing display, structured as a table, where each column has a distinctly defined name. This is a DataFrame. The information shown at the bottom includes the number of rows and columns. It’s actually pretty logical, because we’ve selected two columns here. Given that there are at least two columns, it can’t be a Series.

Let’s now take a look at this:

Series example consisting of 150 rows by 2 columns — Series example

You can see that the display is a bit more basic and that the information displayed below is not the same. This is a Series.

The information provided includes the name of the Series (which matches the name displayed at the top of the column when it’s part of a DataFrame), the length (or number of rows) and the column’s data type (integer, string, etc.).

A Series can only contain a single data type, whereas a DataFrame (essentially, a set of Series) can contain columns of different types, such as a column of integers, a column of floating decimal numbers, etc.

We need to keep these ideas in mind when carrying out data analysis. There are numerous methods that are common to these two objects, whereas others apply to specific objects only. For example, all of the attributes we saw in the previous chapter (i.e., shape , head and dtypes ) can apply to Series. But some of the methods we’ll see later on in this course are exclusive to DataFrames due to their multidimensional nature.

Let’s now take a look at some of the ways we can work with a Series.

Work With Columns

In this section, we’re going to have a look at some of the key operations we use with DataFrames:

Creating or deleting a column
Renaming a column
Changing the column data type
Sorting a DataFrame based on one or more columns

Change an Existing Column

Before we see how to create a column, let’s first look at how to change one that already exists.

Using the syntax we saw before, we have dataframe_name[column_name] . This code enables us to access a column in a DataFrame without changing it. But what you don’t yet know is that it also enables you to change an existing DataFrame.

Let’s consider variable a that exists in our code:

We can access it using its name and apply various different functions to it (e.g., print(a) to display it) without changing its value.
But we can also define or change its value (e.g., a=4 , which will store the integer 4 in our variable a ).

We can do exactly the same thing with a column in a DataFrame:

We can access it and apply various functions or methods to it, which is what we’ve done up to now.
We can also change the values inside the column.

For example, if I run the following:

customers['name'] = 1

This will do two things. Firstly, it will update the existing name variable, replacing all values with 1, and it will also transform the variable’s data type, as you can see if you look at the contents:

Transforming an “object” type column to “integer” type : names become 1s — Transforming an “object” type column to “integer” type

name was originally an object type column, but now it’s an integer type, because 1 is an integer.

So, if we replace a column using a fixed value (as we did here), this will replace all values in our column with this fixed value and will change the type of the column to match the type of the specified fixed value, whether or not it’s numeric.

We can also replace a column with an object that has the same dimensions. What we mean by this is a list, an array or a series containing exactly the same number of elements. For example, if I wanted to change the identifier column by multiplying all values by 100, I could do it as follows:

customers['identifier'] = customers['identifier']*100

I might also decide to replace it with random values between 1 and 1000:

customers['identifier'] = np.random.randint(1, 1000, customers.shape[0])

Okay, I get it! But now I want to go back to the original values, because my name column is no longer particularly meaningful!

Unfortunately, Pandas doesn’t keep track of changes made to DataFrames, so it’s not possible to get our previous values back. But don’t panic! If you’ve run these lines of code yourself, you just need to reload the data using the command pd.read_csv to restore your data to the original values.

Create and Delete a Column

Now, how do we create a column? Well, in the same way that we change one! We refer to the column as if it already exists and assign it a value. Not convinced? Let me show you an example:

customers['id'] = customers['identifier'] + 1000
customers.head()

If you run these lines of code yourself, you’ll see that even though the id column doesn’t exist in your DataFrame, Pandas will create it and assign the defined value to it (in this case, the value from identifier, plus 1000).

To sum up, whether you’re updating or creating a column ( col ), the syntax will be: my_dataframe['col'] = x , where x is either a fixed value or an object with the same dimensions as the column that we want to update or create.

What about deleting an existing column?

There are three official ways of doing this. First of all, we have the .drop method for DataFrames:

customers.drop(columns='id')

The next method involves using the del keyword:

del customers['id']
customers.head()

And finally we have the .pop method:

customers.pop('id')
customers.head()

These three methods all have the same effect, so it’s up to you which one you use!

Rename a Column

The method we use to rename a column is .rename . We can rename one or more columns using the following syntax: my_dataframe.rename(columns={'old name': 'new name'})

So, here’s how we could rename our identifier column to ide :

customers.rename(columns={'identifier': 'ide'})

And, of course, we can also rename several columns at a time, for example, we can change email to e-mail :

customers.rename(columns={'identifier': 'ide', 'email': 'e-mail'})

Change the Column Data Type

The .astype method is used to change the column data type. For example, if we wanted to change the identifier column, which originally contained integers, to a float column type, we could do the following:

customers['identifier'] = customers['identifier'].astype(float)

All you need to do is provide the new type within the parentheses!

Sort a DataFrame

In data analysis, we often need to sort data by one or more columns. Pandas provides the .sort_values method, which makes it really easy. You just need to provide the names of the columns you want to use to sort the data. Here are some examples:

# sort by identifier in ascending order:
customers.sort_values('identifier')

# sort by identifier in descending order:
customers.sort_values('identifier', ascending = False)

# sort by gender then last name in ascending order:
customers.sort_values(['gender', 'name'])

The ascending keyword lets us specify whether we want to sort in ascending or descending order.

But what if we want one column in ascending order and the other in descending order?

In this case, we need to pass a list of booleans as a parameter to the ascending keyword. Here’s an example of how we could sort by gender in ascending order and name in descending order:

customers.sort_values(['gender', 'name'], ascending=[True, False])

So, now you’re up to speed on these tools, how about putting it all into practice?

Over to You!

Background

We’re now going to work on our file of real estate loans.

Quick synopsis of the file: each row represents a loan that has been granted to one of our customers. Each customer is uniquely identified using (yes, you’ve guessed it) an identifier! We have the following details:

The city and ZIP code of the branch where the customer arranged the loan
The customer’s monthly income
The customer’s monthly repayments
The agreed loan term, expressed as a number of months
The loan type
The interest rate

Guidelines

This time, your role is to modify the dataset so you can calculate the variables needed to identify customers who are approaching the limits of their repayment capacity, and determine the bank’s profits.

Here’s an exercise so you can practice manipulating a DataFrame.

Check Your Work

Why don’t you compare your answers with the exercise solution?

Let’s Recap

We can select one or more columns in a DataFrame using the my_dataframe[col] syntax, where col is either the name of the column we want to select (if only one column is specified), or a list of column names (if we want to select many).
A DataFrame column is known as a Series in Pandas.
There are many possible operations we can perform with a Series. For example, we can:
- update, add or delete a column.
- change the column name using the .rename method.
- change the column data type using the .astype method.
- sort a DataFrame using the .sort_values() method.

Now, let’s see how to filter the rows so that we have full control of our DataFrame.