Numerical Data with Numpy Arrays

We left off by introducing a couple tools from Scipy and Matplotlib for basic data processing and visualization.

In this case, we had a very small and simple data set - roughly 100 differents numbers - so performance wasn't an issue. However, as the data sets grow, the overhead of using Python compared to a compiled language like C++ starts to show.

For example, if we wanted to do some image processing for a desktop wallpaper image (say 1280 x 1024 pixels large), just to access every pixel is quite a few steps. If we wanted to do something like apply a Gaussian blur on a small neighborhood of each pixel using a convolution, we've just increased the number of steps by an order of magnitude!

Another example is, if we try time-stepping a hyperbolic PDE like an advection equation using a finite difference method using a fine grid, we need to do a potentially large number of operations per step, so we'd like to make this as fast as possible to get any sort of long term evolution.

To overcome this, we're going to introduce Numpy arrays as a way to get close to compiled level performance while still bring able to use a high-level description.

We'll start out getting comfortable with the basic construction, manipulation and operations of arrays.

Basic Array Construction

Let's start with a simple example and define a couple Numpy arrays. This is going to seem similar to lists at first.

In [70]:
import numpy as np

A = np.array([1., 2., 3., 4.])
B = np.array([2., 5., 1., 3.])

print A
print B
[ 1.  2.  3.  4.]
[ 2.  5.  1.  3.]

If I want to get fancier, I can even define multidimensional arrays! For example, let's define a 2d array:

In [118]:
A = np.array([
    [1., 2., 3.],
    [4., 5., 6.],
])

print A
[[ 1.  2.  3.]
 [ 4.  5.  6.]]

The way Numpy refers to these dimensions is via the shape attribute. If we want to check the dimensions of an array, we just use:

In [237]:
A.shape
Out[237]:
(2, 3)

As you can imagine, it would be tedious to design arrays by hand like this. Luckily, there are a number of convinience methods for constructing commonly used arrays. Let's start out with the analogue of the familiar range function.

In [122]:
n = np.arange(1, 10)
n
Out[122]:
array([1, 2, 3, 4, 5, 6, 7, 8, 9])
In [123]:
n = np.arange(1, 10, 2)
n
Out[123]:
array([1, 3, 5, 7, 9])

Another very common variation on this is doing a uniform subdivision of an interval:

In [224]:
x = np.linspace(0, 1, 10)
x
Out[224]:
array([ 0.        ,  0.11111111,  0.22222222,  0.33333333,  0.44444444,
        0.55555556,  0.66666667,  0.77777778,  0.88888889,  1.        ])

This is frequently used to compute a bunch of values over an interval. For example:

In [225]:
x = np.linspace(-1, 1)
y = x**3 - x/2.0

In order to help see what's happening, let's verify that this looks right by plotting it:

In [226]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.figure()
plt.plot(x, y)
plt.show()

Indexing and Slicing

Just like we saw with Python lists, we can access and slice elements in arrays. This time, we have more freedom since arrays can be multidimensional. As a starting example, let's try filling in some different regions of a zero array.

In [168]:
Z = np.zeros((10, 10))
Z[2, 1] = 1.0
Z
Out[168]:
array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]])
In [169]:
Z = np.zeros((10, 10))
Z[3, :] = 1.0
Z
Out[169]:
array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]])
In [170]:
Z = np.zeros((10, 10))
Z[:, 3] = 1.0
Z
Out[170]:
array([[ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.]])
In [171]:
Z = np.zeros((10, 10))
Z[4, :] += 1.0
Z[:, 3] += 1.0
Z
Out[171]:
array([[ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 1.,  1.,  1.,  2.,  1.,  1.,  1.,  1.,  1.,  1.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.]])

Just like with lists, I can also take a slice from an array. This time slices can be combined per dimension.

In [172]:
Z = np.zeros((10, 10))
Z[3:7, 5:8] = 1.0
Z
Out[172]:
array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  1.,  1.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  1.,  1.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  1.,  1.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  1.,  1.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]])

Finally, striding works just like before. For example, if I want an alternating sequence of 1 and -1, I can use:

In [173]:
x = np.empty(12)
x[::2] = 1.0
x[1::2] = -1.0
x
Out[173]:
array([ 1., -1.,  1., -1.,  1., -1.,  1., -1.,  1., -1.,  1., -1.])

Exercise 1

  1. Create a uniform subdivision of the interval -1.3 to 2.5 with 64 subdivisions.
  2. Generate an array of length $3n$ filled with the cyclic pattern 1, 2, 3.
  3. Create an array of the first 10 odd integers.
  4. Create a 10 x 10 arrays of zeros and then "frame" it with a border of ones.
  5. Create an 8 x 8 array with a checkerboard pattern of zeros and ones using a slicing+striding approach.

Array Operations

We've already hinted at some basic arthimetic operations using linspace. It turns out that these work generally for arrays. We'll also see that many standard functions also have "vectorized" versions which work with arrays.

In [227]:
A = np.array([
    [1., 2., 3.],
    [4., 5., 6.],
])

B = np.array([
    [1., 5., 2.],
    [6., 4., 2.],
])
In [228]:
A + B
Out[228]:
array([[  2.,   7.,   5.],
       [ 10.,   9.,   8.]])
In [229]:
A - B
Out[229]:
array([[ 0., -3.,  1.],
       [-2.,  1.,  4.]])
In [230]:
A * B
Out[230]:
array([[  1.,  10.,   6.],
       [ 24.,  20.,  12.]])

Notice that this is entrywise multiplication! Perhaps this isn't what you were expecting? In order to do matrix-style multiplication, we instead use the dot function. This works for both matrix-vector and matrix-matrix multiplication.

In [243]:
A = np.array([
    [1., 4., 2.],
    [5., 2., 1.],
    [6., 2., 1.],
])

x = np.array([1., 2., 3.])

np.dot(A, x)
Out[243]:
array([ 15.,  12.,  13.])
In [244]:
B = np.array([
    [1., 4., 1.],
    [2., 1., 5.],
    [3., 3., 9.],
])

np.dot(A, B)
Out[244]:
array([[ 15.,  14.,  39.],
       [ 12.,  25.,  24.],
       [ 13.,  29.,  25.]])

It turns out that arrays are also a little "smarter" than just this, and support a feature called broadcasting which automatically allows certain arrays of different shapes to be "stretched" to fit another shape.

For very sophisticated uses, this can be tricky to get right so we won't get into it in-depth here. However, it is useful in simple cases which we'll look at here:

In [ ]:
A + 3.0 # added to each entry
In [180]:
v = np.array([1., 4., 3.])
A + v # added to each row
Out[180]:
array([[ 2.,  6.,  6.],
       [ 5.,  9.,  9.]])
In [184]:
v = np.array([1., 2., 3.])
w = np.array([10., 20., 30.])
v[:, np.newaxis] + w[np.newaxis, :] # all pairs sum
Out[184]:
array([[ 11.,  21.,  31.],
       [ 12.,  22.,  32.],
       [ 13.,  23.,  33.]])

As mentioned before, many common functions are availible in "vectorized" form.

In [186]:
x = np.linspace(-5, 5, 100)
y = np.sin(2*np.pi*x) * x
print y
[ -6.12323400e-15  -2.90464996e+00  -4.58160166e+00  -4.43864021e+00
  -2.60618422e+00   1.42615459e-01   2.71615312e+00   4.13770624e+00
   3.91587234e+00   2.21171244e+00  -2.53055033e-01  -2.49972959e+00
  -3.68110443e+00  -3.40059917e+00  -1.84197448e+00   3.31255908e-01
   2.25623854e+00   3.21324035e+00   2.89428362e+00   1.49787883e+00
  -3.77219937e-01  -1.98659109e+00  -2.73557709e+00  -2.39836717e+00
  -1.18028043e+00   3.91013556e-01   1.69174879e+00   2.24959419e+00
   1.91426741e+00   8.89979363e-01  -3.72767603e-01  -1.37272188e+00
  -1.75678516e+00  -1.44337567e+00  -6.27719525e-01   3.22677043e-01
   1.03056760e+00   1.25865502e+00   9.87054792e-01   3.94187453e-01
  -2.41000594e-01  -6.66388378e-01  -7.56717681e-01  -5.46636773e-01
  -1.90011191e-01   1.28060253e-01   2.81329944e-01   2.52493467e-01
   1.23420599e-01   1.57592649e-02   1.57592649e-02   1.23420599e-01
   2.52493467e-01   2.81329944e-01   1.28060253e-01  -1.90011191e-01
  -5.46636773e-01  -7.56717681e-01  -6.66388378e-01  -2.41000594e-01
   3.94187453e-01   9.87054792e-01   1.25865502e+00   1.03056760e+00
   3.22677043e-01  -6.27719525e-01  -1.44337567e+00  -1.75678516e+00
  -1.37272188e+00  -3.72767603e-01   8.89979363e-01   1.91426741e+00
   2.24959419e+00   1.69174879e+00   3.91013556e-01  -1.18028043e+00
  -2.39836717e+00  -2.73557709e+00  -1.98659109e+00  -3.77219937e-01
   1.49787883e+00   2.89428362e+00   3.21324035e+00   2.25623854e+00
   3.31255908e-01  -1.84197448e+00  -3.40059917e+00  -3.68110443e+00
  -2.49972959e+00  -2.53055033e-01   2.21171244e+00   3.91587234e+00
   4.13770624e+00   2.71615312e+00   1.42615459e-01  -2.60618422e+00
  -4.43864021e+00  -4.58160166e+00  -2.90464996e+00  -6.12323400e-15]

You can see this computes sin at each x value. Although we could do the same using a list, this approach is much faster (and arguably simpler once you get used to it). To verify this, let's plot the function:

In [188]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.figure()
plt.plot(x, y)
plt.show()

Exercise 2

  1. Try using the dot function on a vector-vector, matrix-vector and matrix-matrix example. (This may seem simple but it's good to see how the results differ in each case.)
  2. Create a plot of $x^2 \cdot \sin(1/x^2) + x$ on the interval $[-1, 1]$ using 250 points. Remember to label the axes!
  3. Create a semilogy plot of the relative difference of $1 / (1+x^2)$ and $1/x^2$ on the interval $[5, 25]$. (The relative difference of numbers $a$ and $b$ is given by $|1-a/b|$. It provides a better sense of error relative to the order of magnitudes of $a$ and $b$.)
  4. It was mentioned that many common functions are availible in vectorized form. It turns out that Scipy also has many less common, special functions. Take a look at the extensive list here! Try looking for some interesting ones you recognize (or maybe don't recognize!) and either plug in a few numbers or plot them.

Working with Higher Dimensional Arrays

This is nice but Numpy's power really starts to show on larger data sets. Let's take a step from 1d to 2d computations. As a starting example, we'd like to evaluating and plot the function $f(x, y) = x^2 - y^2$ on the domain $[-2, 2] \times [-2, 2]$.

To do this, we need the analogous of linspace in higher dimensions. We'll form such a thing using a new function called meshgrid.

In [238]:
x, y = np.meshgrid(np.linspace(-2, 2, 100),
                   np.linspace(-2, 2, 100))

f = x**2 - y**2

How do we plot this? We'll introduce a new plotting function called pcolormesh.

In [239]:
plt.figure()
plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.pcolormesh(x, y, f)
plt.show()

This is a good start! However, something about this is asthetically unsatisfying...it doesn't do a good job highlighting an important aspect of the function...the zero set. Let's see how we can make a quick change to improve this.

In [240]:
plt.figure()
plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.pcolormesh(x, y, f, cmap='RdBu')
plt.show()

As an aside, Matplotlib supports a number of different colormaps. Some of these have particular purposes. There are three general classes called sequential, diverging and qualitative. In this case, I'm using the fact the diverging colormaps are good at highlighting a distinct central value in a dataset. If you're interested, you can learn more about this here!

Exercise 3

  1. Create a color plot of $\sin(x) \sin(y)$ on the interval $[-\pi, \pi] \times [-\pi, \pi]$.
  2. Create a function which creates an $n \times n$ array with $(i,j)$-entry equal to $i+j$.

Manipulating Arrays

When working with arrays, it's not uncommon to have to spend some time getting things into the right "shape" or format that you want. Numpy provides a some functions which help with this sort of thing. The first two we'll look at have to do with reshaping an array.

In [255]:
A = np.arange(1, 10)
print A
print A.shape
[1 2 3 4 5 6 7 8 9]
(9,)
In [256]:
B = A.reshape(3, 3)
print B
print B.shape
[[1 2 3]
 [4 5 6]
 [7 8 9]]
(3, 3)

You can see that we've taken our array of 9 elements and "wrapped" it into a 3-by-3 array. To undo this, we could either use reshape again, or we can use the more common ravel:

In [259]:
print np.ravel(B)
print B.ravel() # equivalent but a little shorter.
[1 2 3 4 5 6 7 8 9]
[1 2 3 4 5 6 7 8 9]

It's also common to have a few different arrays that you want to merge or "stack" them together in some way or another. Let's look at how to stack arrays "vertically", "horizontally" and "column-wise".

In [264]:
A = np.array([1., 2., 3.])
B = np.array([4., 1., 2.])
In [265]:
np.vstack([A, B])
Out[265]:
array([[ 1.,  2.,  3.],
       [ 4.,  1.,  2.]])
In [266]:
np.hstack([A, B])
Out[266]:
array([ 1.,  2.,  3.,  4.,  1.,  2.])
In [267]:
np.column_stack([A, B])
Out[267]:
array([[ 1.,  4.],
       [ 2.,  1.],
       [ 3.,  2.]])

More Array Operations

Aside from arthimetic-type operations, Numpy also provides many other things like finding the minimum or maximum value in an array, sorting arrays, adding up all the entries of an array and much, much more. Let's go ahead and take a look at some of the basics.

In [277]:
A = np.random.randint(1, 10, 5)
A
Out[277]:
array([7, 8, 4, 1, 8])
In [284]:
print np.min(A)
print np.max(A)
1
8
In [283]:
np.sort(A)
Out[283]:
array([1, 4, 7, 8, 8])

An interesting variation on these is given by the "argument" version. Instead of picking out the min, max or sorting, we can compute the indices which would arise from these.

In [288]:
print np.argmin(A)
print np.argmax(A)
print np.argsort(A)
3
1
[3 2 0 1 4]

Exercise 4

  1. Evaluate $\cos$ and $\sin$ on the interval $[0, 1]$ and then stack the results into a tall array with rows being the $(\cos(x), \sin(x))$ entries.
  2. Create a random $3 \times 5$ array using the np.random.rand(3, 5) function and compute: the sum of all the entries, the sum of the rows and the sum of the columns. (Just like sorted had an optional key= argument, many Numpy functions have an optional axis= argument!)
  3. Create a random $5 \times 5$ array using the function np.random.rand(5, 5). We want to sort the rows according to the second column. Try combining array slicing + argsort + indexing to do this.

More Advanced Indexing

As a closing topic, let's take a second look at how we can index elements in an array. This time, we're going to use an array to index another.

In [291]:
A = np.arange(1, 10)**2
A
Out[291]:
array([ 1,  4,  9, 16, 25, 36, 49, 64, 81])
In [292]:
i = np.array([1, 5, 3, 2, 3, 1])
A[i]
Out[292]:
array([ 4, 36, 16,  9, 16,  4])

I can also do this in multiple dimensions!

In [296]:
A = np.zeros((5, 5))
i = np.arange(5)
A[i, i] = 1.0
A
Out[296]:
array([[ 1.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  1.]])
In [299]:
A = np.arange(1, 10).reshape(3, 3)
A
Out[299]:
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

Finally, I can even index using multi-dimensional arrays!

In [303]:
i = np.array([
    [0, 0],
    [1, 1],
])

j = np.array([
    [0, 1],
    [2, 0],
])

A[i, j]
Out[303]:
array([[1, 2],
       [6, 4]])

This takes some care to use correctly, however... Since this is just an introduction, we'll leave this as a look at what's to come.