Data Science: Performing Operations on Arrays with Python

By John Paul Mueller, Luca Massaron

You will need to know how to use arrays for data science. A basic form of data manipulation with Python is to place the data in an array or matrix and then use standard math-based techniques to modify its form.

Using this approach puts the data in a convenient form to perform other operations done at the level of every single observation, such as in iterations, because they can leverage your computer architecture and some highly optimized numerical linear algebra routines sent in CPUs. These routines are callable from every operating system. The larger the data and the computations, the more time you can save. In addition, using these techniques also spare you writing long and complex Python code.

Using vectorization

Your computer provides you with powerful routine calculations, and you can use them when your data is in the right format. NumPy’s ndarray is a multidimensional data storage structure that you can use as a dimensional datatable. In fact, you can use it as a cube or even a hypercube when there are more than three dimensions.

Using ndarray makes computations easy and fast. The following example creates a dataset of three observations with seven features for each observation. In this case, the example obtains the maximum value for each observation and subtracts it from the minimum value to obtain the range of values for each observation.

import numpy as np
dataset = np.array([[2, 4, 6, 8, 3, 2, 5],
     [7, 5, 3, 1, 6, 8, 0],
     [1, 3, 2, 1, 0, 0, 8]])
print np.max(dataset, axis=1) - np.min(dataset, axis=1)

The print statement obtains the maximum value from each observation using np.max() and then subtracts it from the minimum value using np.min(). The maximum value in each observation is [8 8 8]. The minimum value for each observation is [2 0 0]. As a result, you get the following output:

[6 8 8]

Performing simple arithmetic on vectors and matrices

Most operations and functions from NumPy that you apply to arrays leverage vectorization, so they’re fast and efficient — much more efficient than any other solution or handmade code. Even the simplest operations such as additions or divisions can take advantage of vectorization.

For instance, many times, the form of the data in your dataset won’t quite match the form you need. A list of numbers could re-sent percentages as whole numbers when you really need them as fractional values. In this case, you can usually perform some type of simple math to solve the problem, as shown here:

import numpy as np
a = np.array([15.0, 20.0, 22.0, 75.0, 40.0, 35.0])
a = a*.01
print a

The example creates an array, fills it with whole number percentages, and then uses 0.01 as a multiplier to create fractional percentages. You can then multiply these fractional values against other numbers to determine how the percentage affects that number. The output from this example is

[ 0.15 0.2 0.22 0.75 0.4 0.35]

Performing matrix vector multiplication

The most efficient vectorization operations are matrix manipulations in which you add and multiply multiple values against other multiple values. NumPy makes performing multiplication of a vector by a matrix easy, which is handy if you have to estimate a value for each observation as a weighted summation of the features. Here’s an example of this technique:

import numpy as np
a = np.array([2, 4, 6, 8])
b = np.array([[1, 2, 3, 4],
    [2, 3, 4, 5],
    [3, 4, 5, 6],
    [4, 5, 6, 7]])
c = np.dot(a, b)
print c

Notice that the array formatted as a vector must appear before the array formatted as a matrix in the multiplication or you get an error. The example outputs these values:

[60 80 100 120]

To obtain the values shown, you multiply every value in the array against the matching column in the matrix — you multiply the first value in the array against the first column, first row of the matrix. For example, the first value in the output is 2 * 1 + 4 * 2 + 6 * 3 + 8 * 4, which equals 60.

Performing matrix multiplication

You can also multiply one matrix against another. In this case, the output is the result of multiplying rows in the first matrix against columns in the second matrix. Here is an example of how you multiply one NumPy matrix against another:

import numpy as np
a = np.array([[2, 4, 6, 8],
    [1, 3, 5, 7]])
b = np.array ([[1, 2],
    [2, 3],
    [3, 4],
    [4, 5]])
c = np.dot(a, b)
print c

In this case, you end up with a 2 x 2 matrix as output. Here are the values you should see when you run the application:

[[60 80]
 [50 66]]

Each row in the first matrix is multiplied by each column of the second matrix. For example, to get the value 50 shown in row 2, column 1 of the output, you match up the values in row two of matrix a with column 1 of matrix b, like this: 1 * 1 + 3 * 2 + 5 * 3 + 7 * 4.