• 8 heures
  • Moyenne

Ce cours est visible gratuitement en ligne.

course.header.alt.is_certifying

J'ai tout compris !

Mis à jour le 22/01/2024

Manipulate a DataFrame

Chapter banner

We’ve imported our first DataFrame, but it certainly won’t be our last! Once we’ve successfully imported our data, the next logical step would be to start manipulating it to suit our purposes.

Here’s a real-life example. We want to access a list of email addresses for all customers who have taken out a loan with us. We’ll use the  customers  DataFrame we imported in the last chapter to create an email marketing list for our special offers. How can we do this using Pandas? Well, that’s what we’re about to find out!

Navigate the DataFrame

Each column in a DataFrame has a distinct name, which makes it easier to understand and lets you access a particular column using its name.

To access a column in a DataFrame, you just need to use the dataframe_name[column_name]  syntax  .

So, here’s how you would access a list of email addresses from the  customers  DataFrame we imported in the previous chapter:

customers['email']

Here we’re accessing the  email  variable in our  customers  DataFrame. The syntax makes it clear to see what we’re trying to access.

We might even decide to store the email variable data in a different variable within our code to help us with our mailing list mentioned above. Here, I’ve created an  email  variable, which I’ll use to store all the email addresses from my customer database:

email = customers['email']

But what if we want to access more than one column?

Well, all you have to do is store the set of columns you want to access in a list. Here’s an example of how we can select the name and email address for our customers:

variables = ['name', 'email']
customers[variables]

I know there are only two lines of code here, but let’s just take a closer look:

  1. We want to access the name and email address for all of our customers. So, the first thing we do is create a list called  variables  , which we’ll use to store the set of variable names.

  2. We use the same syntax as before to select the list of variables defined within our DataFrame.

You’ll see that our set of names and email addresses are displayed correctly on the screen.

It is, of course, possible to perform these two steps in a single command:  customers[['name', 'email]]  .

You’ll notice that the variables display in the order we specified (in this case, name and then email) rather than the order in which they were originally provided (email is before name in our  customers  DataFrame). If you check the initial DataFrame, you’ll see that the original sequence has been retained. So, the selection process allows us to resequence the columns however we like, without changing the original DataFrame.

Discover the Pandas Series Object

Before we go any further, I’d like to ask you a simple question. In your opinion, what objects have we been working with so far in this chapter? You’re starting to understand the DataFrame object, but what about a column within a DataFrame? That’s not a DataFrame. Well, it’s time to unveil the second Pandas object that you’re going to use a great deal and it’s one that’s closely linked to a DataFrame. The Series

Each column in your DataFrame is in fact a  Series  object type. You can check this yourself using the  type  function:

Results returned by the type function
Results returned by the type function

We need to ensure that we understand the difference between these objects. Even though they have many methods in common, there are some which are only used by one or the other and this can lead to errors when carrying out data analysis in Python.

Let’s take a look at the following display:

Example of a customer DataFrame consisting of 150 rows by 2 columns with 2 headers, which are  name and email.
DataFrame example

We have quite a pleasing display, structured as a table, where each column has a distinctly defined name. This is a DataFrame. The information shown at the bottom includes the number of rows and columns. It’s actually pretty logical, because we’ve selected two columns here. Given that there are at least two columns, it can’t be a Series.

Let’s now take a look at this:

Series example consisting of 150 rows by 2 columns
Series example

You can see that the display is a bit more basic and that the information displayed below is not the same. This is a Series

The information provided includes the name of the Series (which matches the name displayed at the top of the column when it’s part of a DataFrame), the length (or number of rows) and the column’s data type (integer, string, etc.).

A Series can only contain a single data type, whereas a DataFrame (essentially, a set of Series) can contain columns of different types, such as a column of integers, a column of floating decimal numbers, etc.

We need to keep these ideas in mind when carrying out data analysis. There are numerous methods that are common to these two objects, whereas others apply to specific objects only. For example, all of the attributes we saw in the previous chapter (i.e.,  shape  ,  head  and  dtypes  ) can apply to Series. But some of the methods we’ll see later on in this course are exclusive to DataFrames due to their multidimensional nature.

Let’s now take a look at some of the ways we can work with a Series.

Work With Columns

In this section, we’re going to have a look at some of the key operations we use with DataFrames:

  • Creating or deleting a column

  • Renaming a column

  • Changing the column data type

  • Sorting a DataFrame based on one or more columns

Change an Existing Column

Before we see how to create a column, let’s first look at how to change one that already exists.

Using the syntax we saw before, we have  dataframe_name[column_name]  . This code enables us to access a column in a DataFrame without changing it. But what you don’t yet know is that it also enables you to change an existing DataFrame.

Let’s consider variable  a  that exists in our code:

  • We can access it using its name and apply various different functions to it (e.g.,  print(a)  to display it) without changing its value.

  • But we can also define or change its value (e.g.,  a=4  , which will store the integer 4 in our variable  a  ).

We can do exactly the same thing with a column in a DataFrame:

  • We can access it and apply various functions or methods to it, which is what we’ve done up to now.

  • We can also change the values inside the column.

For example, if I run the following:

customers['name'] = 1

This will do two things. Firstly, it will update the existing  name  variable, replacing all values with 1, and it will also transform the variable’s data type, as you can see if you look at the contents:

Transforming an “object” type column to “integer” type : names become 1s
Transforming an “object” type column to “integer” type

name  was originally an  object  type column, but now it’s an  integer  type, because 1 is an integer.

So, if we replace a column using a fixed value (as we did here), this will replace all values in our column with this fixed value and will change the type of the column to match the type of the specified fixed value, whether or not it’s numeric.

We can also replace a column with an object that has the same dimensions. What we mean by this is a list, an array or a series containing exactly the same number of elements. For example, if I wanted to change the  identifier  column by multiplying all values by 100, I could do it as follows:

customers['identifier'] = customers['identifier']*100

I might also decide to replace it with random values between 1 and 1000:

customers['identifier'] = np.random.randint(1, 1000, customers.shape[0])

Okay, I get it! But now I want to go back to the original values, because my  name  column is no longer particularly meaningful!

Unfortunately, Pandas doesn’t keep track of changes made to DataFrames, so it’s not possible to get our previous values back. But don’t panic! If you’ve run these lines of code yourself, you just need to reload the data using the command  pd.read_csv  to restore your data to the original values.

Create and Delete a Column

Now, how do we create a column? Well, in the same way that we change one! We refer to the column as if it already exists and assign it a value. Not convinced? Let me show you an example:

customers['id'] = customers['identifier'] + 1000
customers.head()

If you run these lines of code yourself, you’ll see that even though the  id  column doesn’t exist in your DataFrame, Pandas will create it and assign the defined value to it (in this case, the value from identifier, plus 1000).

To sum up, whether you’re updating or creating a column (  col  ), the syntax will be:  my_dataframe['col'] = x  , where x is either a fixed value or an object with the same dimensions as the column that we want to update or create.

What about deleting an existing column?

There are three official ways of doing this. First of all, we have the  .drop  method for DataFrames:

customers.drop(columns='id')

The next method involves using the  del  keyword:

del customers['id']
customers.head()

And finally we have the  .pop  method:

customers.pop('id')
customers.head()

These three methods all have the same effect, so it’s up to you which one you use!

Rename a Column

The method we use to rename a column is  .rename  . We can rename one or more columns using the following syntax:  my_dataframe.rename(columns={'old name': 'new name'})

So, here’s how we could rename our  identifier  column to  ide  :

customers.rename(columns={'identifier': 'ide'})

And, of course, we can also rename several columns at a time, for example, we can change  email  to  e-mail  :

customers.rename(columns={'identifier': 'ide', 'email': 'e-mail'})

Change the Column Data Type

The  .astype  method is used to change the column data type. For example, if we wanted to change the  identifier  column, which originally contained integers, to a float column type, we could do the following:

customers['identifier'] = customers['identifier'].astype(float)

All you need to do is provide the new type within the parentheses!

Sort a DataFrame

In data analysis, we often need to sort data by one or more columns. Pandas provides the  .sort_values  method, which makes it really easy. You just need to provide the names of the columns you want to use to sort the data. Here are some examples:

# sort by identifier in ascending order:
customers.sort_values('identifier')
# sort by identifier in descending order:
customers.sort_values('identifier', ascending = False)
# sort by gender then last name in ascending order:
customers.sort_values(['gender', 'name'])

The  ascending  keyword lets us specify whether we want to sort in ascending or descending order.

But what if we want one column in ascending order and the other in descending order?

In this case, we need to pass a list of booleans as a parameter to the  ascending  keyword. Here’s an example of how we could sort by gender in ascending order and name in descending order:

customers.sort_values(['gender', 'name'], ascending=[True, False])

So, now you’re up to speed on these tools, how about putting it all into practice?

Over to You!

Over to you exercise banner

Background

We’re now going to work on our file of real estate loans.

Quick synopsis of the file: each row represents a loan that has been granted to one of our customers. Each customer is uniquely identified using (yes, you’ve guessed it) an identifier! We have the following details:

  • The city and ZIP code of the branch where the customer arranged the loan

  • The customer’s monthly income

  • The customer’s monthly repayments

  • The agreed loan term, expressed as a number of months

  • The loan type

  • The interest rate

Guidelines

This time, your role is to modify the dataset so you can calculate the variables needed to identify customers who are approaching the limits of their repayment capacity, and determine the bank’s profits.

Here’s an exercise so you can practice manipulating a DataFrame.

Check Your Work

Why don’t you compare your answers with the exercise solution?

Let’s Recap

  • We can select one or more columns in a DataFrame using the  my_dataframe[col]  syntax, where  col  is either the name of the column we want to select (if only one column is specified), or a list of column names (if we want to select many).

  • A DataFrame column is known as a Series in Pandas.

  • There are many possible operations we can perform with a Series. For example, we can:

    • update, add or delete a column.

    • change the column name using the  .rename  method.

    • change the column data type using the  .astype  method.

    • sort a DataFrame using the  .sort_values()  method.

Now, let’s see how to filter the rows so that we have full control of our DataFrame.

Exemple de certificat de réussite
Exemple de certificat de réussite