Data Science: How to Send Data in Unstructured File Form with Python

By John Paul Mueller, Luca Massaron

You can use Python to send data in unstructured file form. Unstructured data files consist of a series of bits. The file doesn’t separate the bits from each other in any way. You can’t simply look into the file and see any structure because there isn’t any to see. Unstructured file formats rely on the file user to know how to interpret the data.

For example, each pixel of a picture file could consist of three 32-bit fields. Knowing that each field is 32-bits is up to you. A header at the beginning of the file may provide clues about interpreting the file, but even so, it’s up to you to know how to interact with the file.

The example here shows how to work with a picture as an unstructured file. The example image is a public domain offering. To work with images, you need to access the Scikit-image library, which is a free-of-charge collection of algorithms used for image processing. A tutorial for this library is available if you need help.

The first task is to be able to display the image onscreen using the following code. (This code can require a little time to run. The image is ready when the busy indicator disappears from the IPython Notebook tab.)

from skimage.io import imread
from skimage.transform import resize
from matplotlib import pyplot as plt
import matplotlib.cm as cm
example_file = (“http://upload.wikimedia.org/” +
 “wikipedia/commons/7/7d/Dog_face.png”)
image = imread(example_file, as_grey=True)
plt.imshow(image, cmap=cm.gray)
plt.show()

The code begins by importing a number of libraries. It then creates a string that points to the example file online and places it in example_file. This string is part of the imread() method call, along with as_grey, which is set to True. The as_grey argument tells Python to turn any color images into gray scale. Any images that are already in gray scale remain that way.

Now that you have an image loaded, it’s time to render it (make it ready to display onscreen. The imshow() function performs the rendering and uses a grayscale color map. The show() function actually displays image for you.

The image appears onscreen after you render and show it.

The image appears onscreen after you render and show it.

Close the image when you’re finished viewing it. (The asterisk in the In [*]: entry tells you that the code is still running and you can’t move on to the next step.) The act of closing the image ends the code segment. You now have an image in memory and you may want to find out more about it. When you run the following code, you discover the image type and size:

print(“data type: %s, shape: %s” %
  (type(image), image.shape))

The output from this call tells you that the image type is a numpy.ndarray and that the image size is 90 pixels by 90 pixels. The image is actually an array of pixels that you can manipulate in various ways. For example, if you want to crop the image, you can use the following code to manipulate the image array:

image2 = image[5:70,0:70]
plt.imshow(image2, cmap=cm.gray)
plt.show()

The numpy.ndarray in image2 is smaller than the one in image, so the output is smaller as well. The purpose of cropping the image is to make it a specific size. Both images must be the same size for you to analyze them. Cropping is one way to ensure that the images are the correct size for analysis.

Cropping the image makes it smaller.

Cropping the image makes it smaller.

Another method that you can use to change the image size is to resize it. The following code resizes the image to a specific size for analysis:

image3 = resize(image2, (30, 30), mode=‘nearest’)
plt.imshow(image3, cmap=cm.gray)
print(“data type: %s, shape: %s” %
  (type(image3), image3.shape))

The output from the print() function tells you that the image is now 30 pixels by 30 pixels in size. You can compare it to any image with the same dimensions.

After you have all the images the right size, you need to flatten them. A dataset row is always a single dimension, not two dimensions. The image is currently an array of 30 pixels by 30 pixels, so you can’t make it part of a dataset. The following code flattens image3 so that it becomes an array of 900 elements that is stored in image_row.

image_row = image3.flatten()
print(“data type: %s, shape: %s” %
  (type(image_row), image_row.shape))

Notice that the type is still a numpy.ndarray. You can add this array to a dataset and then use the dataset for analysis purposes. The size is 900 elements, as anticipated.