Manipulating Dataset Entries for Functional Programming - dummies

Manipulating Dataset Entries for Functional Programming

By John Paul Mueller

You’re unlikely to find a common dataset used with Python that doesn’t provide relatively good documentation. You need to find the documentation online if you want the full story about how the dataset is put together, what purpose it serves, and who originated it, as well as any needed statistics you need to suit your functional programming goals. Fortunately, you can employ a few tricks to interact with a dataset without resorting to major online research.

Determining the dataset content for functional programming

Once you load or fetch existing datasets from specific sources, you can apply them to your functional programming goals. These datasets generally have specific characteristics that you can discover online at places like Sci-kit resources for the Boston house-prices dataset. However, you can also use the dir() function to learn about dataset content. When you use dir(Boston) with the previously created Boston house-prices dataset, you discover that it contains DESCR, data, feature_names, and target properties. Here is a short description of each property:

  • DESCR: Text that describes the dataset content and some of the information you need to use it effectively
  • data: The content of the dataset in the form of values used for analysis purposes
  • feature_names: The names of the various attributes in the order in which they appear in data
  • target: An array of values used with data to perform various kinds of analysis

The print(Boston.DESCR) function displays a wealth of information about the Boston house-prices dataset, including the names of attributes that you can use to interact with the data. Check out the results of these queries.

functional programming dataset configuration
Most common datasets are configured to tell you about themselves.

The information that the datasets contain can have significant commonality. For example, if you use dir(data) for the Olivetti faces dataset example described earlier, you find that it provides access to DESCR, data, images, and target properties. As with the Boston house-prices dataset, DESCR gives you a description of the Olivetti faces dataset, which you can use for things like accessing particular attributes. By knowing the names of common properties and understanding how to use them, you can discover all you need to know about a common dataset in most cases without resorting to any online resource. In this case, you’d use print(data.DESCR) to obtain a description of the Olivetti faces dataset. Also, some of the description data contains links to sites where you can learn more information.

Using the dataset sample code for functional programming

The online sources are important because they provide you with access to sample code, in addition to information about the dataset. For example, the Boston house-prices site provides access to six examples, one of which is the Gradient Boosting Regression example. Discovering how others access these datasets can help you build your own code. Of course, the dataset doesn’t limit you to the uses shown by these examples; the data is available for any use you might have for it.

Creating a DataFrame

The common datasets are in a form that allows various types of analysis, as shown by the examples provided on the sites that describe them. However, you might not want to work with the dataset in that manner; instead, you may want something that looks a bit more like a database table. Fortunately, you can use the pandas library to perform the conversion in a manner that makes using the datasets in other ways easy. Using the Boston house-prices dataset as an example, the following code performs the required conversion:

import pandas as pd
BostonTable = pd.DataFrame(Boston.data,
columns=Boston.feature_names)

If you want to include the target values with the DataFrame, you must also execute: BostonTable['target'] = Boston.target. However, here you don’t use target data.

Accessing specific records for functional programming

If you were to do a dir() command against a DataFrame, you would find that it provides you with an overwhelming number of functions to try. The documentation at panda supplies a good overview of what’s possible (which includes all the usual database-specific tasks specified by CRUD). The following example code shows how to perform a query against a pandas DataFrame. In this case, the code selects only those housing areas where the crime rate is below 0.02 per capita.

CRIMTable = BostonTable.query('CRIM < 0.02')
print(CRIMTable.count()['CRIM'])

The output shows that only 17 records match the criteria. The count() function enables the application to count the records in the resulting CRIMTable. The index, ['CRIM'], selects just one of the available attributes (because every column is likely to have the same values).

You can display all these records with all of the attributes, but you may want to see only the number of rooms and the average house age for the affected areas. The following code shows how to display just the attributes you actually need:

print(CRIMTable[['RM', 'AGE']])

The image below shows the output from this code. As you can see, the houses vary between 5 and nearly 8 rooms in size. The age varies from almost 14 years to a little over 65 years.

manipulate dataset for functional programmimg
Manipulating the data helps you find specific information.

You might find it a bit hard to work with the unsorted data you see above. Fortunately, you do have access to the full range of common database features. If you want to sort the values by number of rooms, you use:

print(CRIMTable[['RM', 'AGE']].sort_values('RM'))

As an alternative, you can always choose to sort by average home age:

print(CRIMTable[['RM', 'AGE']].sort_values('AGE'))