Aggregating Data at Any Level with Python - dummies

Aggregating Data at Any Level with Python

By John Paul Mueller, Luca Massaron

You can use Python for aggregating data. Aggregation is useful in data science. Aggregation is the process of combining or grouping data together into a set, bag, or list. The data may or may not be alike. However, in most cases, an aggregation function combines several rows together statistically using algorithms such as average, count, maximum, median, minimum, mode, or sum. There are several reasons to aggregate data:

  • Make it easier to analyze

  • Reduce the ability of anyone to deduce the data of an individual from the dataset for privacy or other reasons

  • Create a combined data element from one data source that matches a combined data element in another source

The most important use of data aggregation is to promote anonymity in order to meet legal or other concerns. Sometimes even data that should be anonymous turns out to provide identification of an individual using the proper analysis techniques. For example, researchers have found that it’s possible to identify individuals based on just three credit card purchases. Here’s an example that shows how to perform aggregation tasks:

import pandas as pd
df = pd.DataFrame({‘Map’: [0,0,0,1,1,2,2],
     ‘Values’: [1,2,3,5,4,2,5]})
df[‘S’] = df.groupby(‘Map’)[‘Values’].transform(np.sum)
df[‘M’] = df.groupby(‘Map’)[‘Values’].transform(np.mean)
df[‘V’] = df.groupby(‘Map’)[‘Values’].transform(np.var)
print df

In this case, you have two initial features for this DataFrame. The values in Map define which elements in Values belong together. For example, when calculating a sum for Map index 0, you use the Values 1, 2, and 3.

To perform the aggregation, you must first call groupby() to group the Map values. You then index into Values and rely on transform() to create the aggregated data using one of several algorithms found in NumPy, such as np.sum. Here are the results of this calculation:

 Map Values S M V
0 0  1 6 2.0 1.0
1 0  2 6 2.0 1.0
2 0  3 6 2.0 1.0
3 1  5 9 4.5 0.5
4 1  4 9 4.5 0.5
5 2  2 7 3.5 4.5
6 2  5 7 3.5 4.5