Data Science: Filtering and Selecting Data with Python
Python is a useful tool for data science. You may not need to work with all the data in a dataset. In fact, looking at just one particular column might be beneficial, such as age, or a set of rows with a significant amount of information. You perform two steps to obtain just the data you need to perform a particular task:
Filter rows to create a subject of the data that meets the criterion you select (such as all the people between the ages of 5 and 10).
Select data columns that contain the data you need to analyze. For example, you probably don’t need the individuals’ names unless you want to perform some analysis based on name.
The act of slicing and dicing data, gives you a subset of the data suitable for analysis. Here are various ways to obtain specific pieces of data to meet particular needs.
Slicing can occur in multiple ways when working with data, but the technique of interest here is to slice data from a row of 2D or 3D data. A 2D array may contain temperatures (x axis) over a specific timeframe (y axis). Slicing a row would mean seeing the temperatures at a specific time. In some cases, you might associate rows with cases in a dataset.
A 3D array might include an axis for place (x axis), product (y axis), and time (z axis) so that you can see sales for items over time. Perhaps you want to track whether sales of an item are increasing, and specifically where they are increasing. Slicing a row would mean seeing all the sales for one specific product for all locations at any time. The following example demonstrates how to perform this task:
x = np.array([[[1, 2, 3], [4, 5, 6], [7, 8, 9],], [[11,12,13], [14,15,16], [17,18,19],], [[21,22,23], [24,25,26], [27,28,29]]]) x
In this case, the example builds a 3D array. It then slices row 1 of that array to produce the following output:
array([[11, 12, 13], [14, 15, 16], [17, 18, 19]])
Using the examples from above, slicing columns would obtain data at a 90-degree angle from rows. In other words, when working with the 2D array, you would want to see the times at which specific temperatures occurred. Likewise, you might want to see the sales of all products for a specific location at any time when working with the 3D array. In some cases, you might associate columns with features in a dataset. The following example demonstrates how to perform this task using the same array:
x = np.array([[[1, 2, 3], [4, 5, 6], [7, 8, 9],], [[11,12,13], [14,15,16], [17,18,19],], [[21,22,23], [24,25,26], [27,28,29]]]) x[:,1]
Notice that the indexing now occurs at two levels. The first index refers to the row. Using the colon (:) for the row means to use all the rows. The second index refers to a column. In this case, the output will contain column 1. Here’s the output you see:
array([[ 4, 5, 6], [14, 15, 16], [24, 25, 26]])
This is a 3D array. Therefore, each of the columns contains all the z axis elements. What you see is every row — 0 through 2 for column 1 with every z axis element 0 through 2 for that column.
The act of dicing a dataset means to perform both row and column slicing such that you end up with a data wedge. For example, when working with the 3D array, you might want to see the sales of a specific product in a specific location at any time. The following example demonstrates how to perform this task using the same array as above:
x = np.array([[[1, 2, 3], [4, 5, 6], [7, 8, 9],], [[11,12,13], [14,15,16], [17,18,19],], [[21,22,23], [24,25,26], [27,28,29]]]) print x[1,1] print x[:,1,1] print x[1,:,1] print print x[1:2, 1:2]
This example dices the array in four different ways. First, you get row 1, column 1. Of course, what you may actually want is column 1, z axis 1. If that’s not quite right, you could always request row 1, z axis 1 instead. Then again, you may want rows 1 and 2 of columns 1 and 2. Here’s the output of all four requests:
[14 15 16] [ 5 15 25] [12 15 18] [[[14 15 16] [17 18 19]] [[24 25 26] [27 28 29]]]