Discover NumPy
Data analysts and data scientists need to manipulate a high volume of data from different sources in their day-to-day work. This data generally comes in the form of lists or tables of numbers.
However, the native objects available in Python are fairly limited in their ability to handle this data correctly.
Let’s look at the following list, which shows the monthly income for five of our bank’s customers:
income = [1800, 1500, 2200, 3000, 2172]
If we want to calculate the average, there’s no existing default function that can do this. We have to code it ourselves! It’s increasingly complicated if we want something more specific, such as the median.
This is where the NumPy library comes in!
The name is derived from Numerical Python and the library provides a set of functions to efficiently perform operations on data. It also provides an object that is at the heart of almost all Python data ecosystems: the array.
In this chapter, I’m going to give you more details about this library and why it’s become a bit of a legend in the Python world.
Compare Arrays With Standard Lists
NumPy Arrays
NumPy arrays are quite similar to Python lists, but with some major advantages. They enable you to carry out many complex operations quickly and easily and they also help you to store and manipulate data.
Let’s take the above example where we wanted to calculate the average. This isn’t a complicated function at all. You could write:
def average(list):
return sum(list)/len(list)
average(income) # => 2134.4
But NumPy gives you direct access to an existing function that does this for you.
Let’s start by importing the NumPy library into our notebook:
import numpy as np
Now calculate the average:
np.mean(income)
Well, that looks pretty good. But it’s hardly revolutionary!
Yes, I can see why you’d say that, but the real revolution comes with NumPy arrays. It’s estimated that as the size of the table increases, using NumPy array functions becomes 30 times faster than using a normal Python list.
This major difference is due to the fact that NumPy arrays can only contain one type of object. Unlike a standard list, where different object types can be stored, NumPy can only accept one.
On top of this, NumPy also provides many other essential math functions that can be applied to arrays or lists:
x = [-2, -1, 1, 2]
print("Absolute value: ", np.abs(x))
print("Exponential: ", np.exp(x))
print("Logarithm: ", np.log(np.abs(x)))
There are many others available, which you can explore in the NumPy documentation. So, let’s create our first array.
Create a NumPy Array
You can do this in a number of different ways, but the simplest is to start from a standard Python list:
income_array = np.array(income)
income_array
# array([1800, 1500, 2200, 3000, 2172])
There are also NumPy functions that allow you to create arrays using a particular pattern or specification. The most commonly used ones are:
np.zeros(n)
: creates an array of zeros consisting of n elementsnp.ones(n)
: similar to the above function, this one creates an array of onesnp.arange(i, j, p)
: creates an array containing a linear sequence from i to j in steps of pnp.linspace(i, j, n)
: creates an array of n evenly spaced values between i and j
Create a Monotype Array
As we’ve seen previously, arrays can only contain a single data type. You can access the array element data type using the .dtype
method:
income_array.dtype
# dtype('int32')
For more information, you can refer to the official NumPy documentation on data types.
Now let’s see how to select elements from a NumPy array.
Select Elements From an Array
So, how do we access elements in an array? Just like a standard list, you can use the array’s numeric index to access the elements.
Access a Single Element
Accessing a single element in an array can be done using the syntax array_name[index]
:
# to access the fifth element
income_array[4]
# to access the last element
income_array[-1]
# You can also modify values:
income_array[1] = 1900
Access Many Sequential Elements
We can access a set of contiguous elements by combining []
with :
. The syntax follows a simple rule: array_name[i:j:p]
, where:
i is the starting element.
j is the final element.
p is the step.
This will select all elements between i and j–1
, because the index j is excluded. The line of code:
income_array[0:3]
takes our income_array
and selects all elements from 0 (i.e., the first element) to 2 (the third element). The element with index 3 is not included! If you wanted to select it, you’d need to add 1 to the final index: income_array[0:4]
.
# First 3 elements
print(income_array[:3])
# Elements starting from index no. 2
print(income_array[2:])
# Every other element
print(income_array[::2])
If the step value is negative, the start and end of the selection are reversed. You can use this to easily reverse an array.
income_array[::-1]
Access Elements Based on a Condition
We might want to go further in accessing our different array elements. When performing data analysis, you often need to select some of the data based on criteria (e.g., a person’s gender) or conditions (all people below a certain age). You can do this using NumPy arrays.
In the same way that you can select elements using their index, with NumPy arrays, you can provide conditions for selecting table elements using array_name[condition]
.
Here’s an example showing how to only select elements with a value greater than $2,000:
income_array[income_array > 2000]
# array([2200, 3000, 2172])
Let’s see how this selection process works in a little more detail:
First of all, we specify the name of the array before the square brackets ( [ ] ) to indicate which array we want to select from.
Then, we specify the condition within the square brackets. The result of the condition specified within the square brackets must be an array or list of boolean values. In our case, if we run the condition
income_array > 2000
, you’ll see that we end up with an array containing boolean values, showingTrue
where the condition is met, andFalse
if not.
This can soon become complicated when you have multiple conditions:
income_array[(income_array > 2000) & (income_array < 3000)]
# array([2200, 2172])
Note that this syntax is used not only to select elements, but also to update elements within an array. So, if you ran the following:
income_array[3] = 1790
, this means that you’re replacing the fourth element in our array with the value 1790.income_array[income_array > 2000] = 0
, you’re replacing all values in our array that are greater than 2,000 with the value 0. You need to be really careful when you’re using this syntax, or you might end up replacing values that you didn’t want to replace.
Now let’s have a look at some of the many methods we can use with arrays.
Use Array Methods
Up to now, we’ve seen that we can obtain an array’s data type using the .dtype
method. In the same way, we can easily access the array dimensions using the .shape
method:
income_array.shape
# (5,) we have an array containing 5 elements
The result tells us that we have an array containing five elements (we’ll see later why there’s a comma after the 5).
We can also apply a whole host of mathematical operations really easily:
# calculate the mean average
income_array.mean()
# calculate the maximum (or minimum)
income_array.max()
income_array.min()
# return the index number for the minimum (or maximum) element
income_array.argmin()
income_array.argmax()
# sort into ascending order:
income_array.sort()
print(income_array)
# calculate the sum:
income_array.sum()
This is, of course, a non-exhaustive list! You’ll find a full list of methods that can be applied to arrays in the official documentation.
Over to You!
Background
Throughout this course, you’ll be playing the role of someone who works in the data analysis department at a bank. More specifically, you work in the loans department. Your aim is to use your new-found Python libraries knowledge to help the team with various tasks.
To do this, you can use the Google Colaboratory notebooks created especially for this purpose.
Guidelines
For your first task, we’ve provided you with income figures for 10 of our bank’s customers. You’ll use the various data manipulation techniques in this chapter to select customer income based on specific conditions and perform certain operations.
Come join me over at this link to see more details about the exercise.
Check Your Work
Once you’ve finished this one, you can check your work against the solution.
Let’s Recap
NumPy (which stands for Numerical Python) is a Python library that enables you to work with numeric data and perform numerous mathematical operations on a table of data.
The data is stored in a structure similar to a Python list, called a NumPy table or array.
An array, unlike a list, is monotype—i.e., it can only contain data of one type.
In an array, you can select:
an element by index number, using the following notation:
array_name[index]
.several sequential elements, using the following notation:
array_name[start:end:step]
.specific elements that meet a condition:
array_name[condition]
.
Arrays have many different methods that enable you to manipulate data or perform mathematical operations very easily.
Now let’s take a deep dive into data manipulation using NumPy, including multidimensional tables!