Image Classification with Hadoop

By Dirk deRoos

Image classification requires a significant amount of data processing resources, however, which has limited the scale of deployments. Image classification is a hot topic in the Hadoop world because no mainstream technology was capable — until Hadoop came along — of opening doors for this kind of expensive processing on such a massive and efficient scale.

Image classification starts with the notion that you build a training set and that computers learn to identify and classify what they’re looking at. In the same way that having more data helps build better fraud detection and risk models, it also helps systems to better classify images.

In this use case, the data is referred to as the training set as well as the models are classifiers. Classifiers recognize features or patterns within sound, image, or video and classify them appropriately. Classifiers are built and iteratively refined from training sets so that their precision scores (a measure of exactness) and recall scores (a measure of coverage) are high.

Hadoop is well suited for image classification because it provides a massively parallel processing environment to not only create classifier models (iterating over training sets) but also provide nearly limitless scalability to process and run those classifiers across massive sets of unstructured data volumes.

Consider multimedia sources such as YouTube, Facebook, Instagram, and Flickr — all are sources of unstructured binary data. The figure shows one way you can use Hadoop to scale the processing of large volumes of stored images and video for multimedia semantic classification.


You can see how all the concepts relating to the Hadoop processing framework are applied to this data. Notice how images are loaded into HDFS. The classifier models, built over time, are now applied to the extra image-feature components in the Map phase of this solution. As you can see in the lower-right corner, the output of this processing consists of image classifications that range from cartoons to sports and locations, among others.

Hadoop can be used for audio or voice analytics, too. One security industry client we work with creates an audio classification system to classify sounds that are heard via acoustic-enriched fiber optic cables laid around the perimeter of nuclear reactors.

For example, this system knows how to nearly instantaneously classify the whisper of the wind as compared to the whisper of a human voice or to distinguish the sound of human footsteps running in the perimeter parklands from that of wildlife.

This description may have sort of a Star Trek feel to it, but you can now see live examples. In fact, IBM makes public one of the largest image-classification systems in the world, via the IBM Multimedia Analysis and Retrieval System (IMARS).

Here are the result of an IMARS search for the term alpine skiing. At the top of the figure, you can see the results of the classifiers mapped to the image set that was processed by Hadoop, along with an associated tag cloud.


Note the more coarsely defined parent classifier, as opposed to the more granular. In fact, notice the multiple classification tiers: rolls into, which rolls into — all generated automatically by the classifier model, built and scored using Hadoop.

None of these pictures has any added metadata. No one has opened iPhoto and tagged an image as a winter sport to make it show up in this classification. It’s the winter sport classifier that was built to recognize image attributes and characteristics of sports that are played in a winter setting.

Image classification has many applications, and being able to perform this classification at a massive scale using Hadoop opens up more possibilities for analysis as other applications can use the classification information generated for the images.

Look at this example from the health industry. A large health agency in Asia was focused on delivering health care via mobile clinics to a rural population distributed across a large land mass. A significant problem that the agency faced was the logistical challenge of analyzing the medical imaging data that was generated in its mobile clinics.

A radiologist is a scarce resource in this part of the world, so it made sense to electronically transmit the medical images to a central point and have an army of doctors examine them. The doctors examining the images were quickly overloaded, however.

The agency is now working on a classification system to help identify possible conditions to effectively provide suggestions for the doctors to verify. Early testing has shown this strategy to help reduce the number of missed or inaccurate diagnoses, saving time, money, and — most of all — lives.