Data Science: How to Use Python to Upload, Stream, and Sample Data

By John Paul Mueller, Luca Massaron

Storing data in local computer memory represents the fastest and most reliable means to access it with Python. The data could reside anywhere. However, you don’t actually interact with the data in its storage location. You load the data into memory from the storage location and then interact with it in memory.

Data scientists call the columns in a database features or variables. The rows are cases.

Uploading small amounts of data into memory

The most convenient method that you can use to work with data is to load it directly into memory. This technique uses the toy dataset from the Scikit-learn library. This example uses the Colors.txt file for input.

Colors.txt file.” width=”258″/>

Format of the Colors.txt file.

The example also relies on native Python functionality to get the task done. When you load a file, the entire dataset is available at all times and the loading process is quite short. Here is an example of how this technique works.

with open(“Colors.txt”, ‘rb’) as open_file:
 print ‘Colors.txt content:n’ + open_file.read()

The example begins by using the open() method to obtain a file object. The open() function accepts the filename and an access mode. In this case, the access mode is read binary (rb). (When using Python 3.x, you may have to change the mode to read (r) in order to avoid error messages.)

It then uses the read() method of the file object to read all the data in the file. If you were to specify a size argument as part of read(), such as read(15), Python would read only the number of characters that you specify or stop when it reaches the End Of File (EOF). When you run this example, you see the following output:

Colors.txt content:
Color  Value
Red  1
Orange 2
Yellow 3
Green  4
Blue  5
Purple 6
Black  7
White  8

The entire dataset is loaded from the library into free memory. Of course, the loading process will fail if your system lacks sufficient memory to hold the dataset. When this problem occurs, you need to consider other techniques for working with the dataset, such as streaming it or sampling it.

Streaming large amounts of data into memory

Some datasets will be so large that you won’t be able to fit them entirely in memory at one time. In addition, you may find that some datasets load slowly because they reside on a remote site. Streaming answers both needs by making it possible to work with the data a little at a time.

You download individual pieces, making it possible to work with just part of the data and to work with it as you receive it, rather than waiting for the entire dataset to download. Here’s an example of how you can stream data using Python:

with open(“Colors.txt”, ‘rb’) as open_file:
 for observation in open_file:
  print ‘Reading Data: ‘ + observation

This example relies on the Colors.txt file, which contains a header, and then a number of records that associate a color name with a value. The open_file file object contains a pointer to the open file.

As the code performs data reads in the for loop, the file pointer moves to the next record. Each record appears one at a time in observation. The code outputs the value in observation using a print statement. You should receive this output:

Reading Data: Color  Value
Reading Data: Red  1
Reading Data: Orange 2
Reading Data: Yellow 3
Reading Data: Green  4
Reading Data: Blue  5
Reading Data: Purple 6
Reading Data: Black  7
Reading Data: White  8

Python streams each record from the source. This means that you must perform a read for each record you want.

Sampling data

Data streaming obtains all the records from a data source. You may find that you don’t need all the records. You can save time and resources by simply sampling the data. This means retrieving records a set number of records apart, such as every fifth record, or by making random samples. The following code shows how to retrieve every other record in the Colors.txt file:

n = 2
with open(“Colors.txt”, ‘rb’) as open_file:
 for j, observation in enumerate(open_file):
  if j % n==0:
   print(‘Reading Line: ‘ + str(j) +
   ‘ Content: ‘ + observation)

The basic idea of sampling is the same as streaming. However, in this case, the application uses enumerate() to retrieve a row number. When j % n == 0, the row is one that you want to keep and the application outputs the information. In this case, you see the following output:

Reading Line: 0 Content: Color  Value
Reading Line: 2 Content: Orange 2
Reading Line: 4 Content: Green  4
Reading Line: 6 Content: Purple 6
Reading Line: 8 Content: White  8

The value of n is important in determining which records appear as part of the dataset. Try changing n to 3. The output will change to sample just the header and rows 3 and 6.

You can perform random sampling as well. All you need to do is randomize the selector, like this:

from random import random
sample_size = 0.25
with open(“Colors.txt”, ‘rb’) as open_file:
 for j, observation in enumerate(open_file):
  if random()<=sample_size:
   print(‘Reading Line: ‘ + str(j) +
   ‘ Content: ‘ + observation)

To make this form of selection work, you must import the random class. The random() method outputs a value between 0 and 1. However, Python randomizes the output so that you don’t know what value you receive. The sample_size variable contains a number between 0 and 1 to determine the sample size.

The output will still appear in numeric order. However, the items selected are random, and you won’t always get precisely the same number of return values. The spaces between return values will differ as well. Here is an example of what you might see as output:

Reading Line: 1 Content: Red  1
Reading Line: 4 Content: Green  4
Reading Line: 8 Content: White  8