Deep Learning For Dummies
Book image
Explore Book Buy On Amazon
One of the machine learning applications of working with images that affects nearly everyone today is computer vision, which is the technique of viewing the individual objects within a frame from a camera or some other source.

When you look at an image, you see objects— perhaps individual people, stoplights, cars, and other items. Whatever the image contains, you see the objects and understand what they are. A computer, however, sees pixels—a 2D image that contains numeric values that translate into color when presented on a screen. In order for a computer to see the objects that you see, it requires some sort of deep learning technology, such as a Convolutional Neural Network (CNN).

The concept of computer vision started in 1966 (yes, that long ago) when Seymour Papert and Marvin Minsky launched the Summer Vision Project, a two-month, ten-person effort to create a computer system using symbolic AI that could identify objects in images. To accomplish this task, the computer would have to move from working with pixels to identifying which pixels belonged to a particular object. Given the technology of the time, the Summer Vision Project didn’t get far.

The next attempt came from a Japanese scientist, Kunihiko Fukushima, in 1979, who proposed the Neocognitron. This project was based on neuroscience research performed on humans, and it attempted to perform its task in a human-like manner. The Neocognitron was successful in a very basic way, but it, too, failed at tasks of any complexity.

The first success came in the 1980s with the efforts of French computer scientist Yan LeCun, who built the CNN, which is inspired by the Neocognitron. CNNs are the building blocks of deep learning–based image recognition, yet they answer only a basic classification need: Given a picture, they can determine whether its content can be associated with a specific image class learned through previous examples. Therefore, when you train a deep neural network to recognize dogs and cats, you can feed it a photo and obtain output that tells you whether the photo contains a dog or cat. The outputs generally come in two forms:

  • If the last network layer is a softmax layer, the network outputs the probability of the photo containing a dog or a cat (the two classes you trained it to recognize) and the output sums to 100 percent.
  • When the last layer is a sigmoid-activated layer, you obtain scores that you can interpret as probabilities of content belonging to each class, independently.
The scores won’t necessarily sum to 100 percent. In both cases, the classification may fail when the following occurs:
  • The main object isn’t what you trained the network to recognize. You may have presented the example neural network with a photo of a raccoon. In this case, the network will output an incorrect answer of dog or cat.
  • The main object is partially obstructed. For instance, your cat is playing hide-and-seek in the photo you show the network, and the network can’t spot it.
  • The photo contains many different objects to detect, perhaps including animals other than cats and dogs. In this case, the output from the network will suggest a single class rather than include all the objects.
The following figure shows image 47780 taken from the MS Coco dataset (released as part of the open source Creative Commons Attribution 4.0 License). The series of three outputs shows how a CNN has detected, localized, and segmented the objects appearing in the image (a kitten and a dog standing on a field of grass). A plain CNN can’t reproduce the examples because its architecture will output the entire image as being of a certain class. To overcome this limitation, researchers extend the basic CNNs capabilities to make them capable of the following:

computer vision Detection, localization and segmentation example from the Coco dataset.
  • Detection: Determining when an object is present in an image. Detection is different from classification because it involves just a portion of the image, implying that the network can detect multiple objects of the same and of different types. The capability to spot objects in partial images is called instance spotting.
  • Localization: Defining exactly where a detected object appears in an image. You can have different types of localizations. Depending on granularity, they distinguish the part of the image that contains the detected object.
  • Segmentation: Classification of objects at the pixel level. Segmentation takes localization to the extreme. This kind of neural model assigns each pixel of the image to a class or even an entity. For instance, the network marks all the pixels in a picture relative to dogs and distinguishes each one using a different label (called instance segmentation).
Of the uses for CNNs, facial recognition is perhaps the most well-known. However, new technologies such as self-driving cars rely extensively on CNNs. In addition, you see CNNs used in places like content moderation, in which an organization must remove unwanted uploads from a website (for example). However, perhaps the most interesting uses for computer vision are in helping people perform tasks better. For example, by using computer vision to monitor how a person is moving their limbs, it becomes possible to provide a better level of physical therapy in patients.

Computer vision is an essential part of the future of deep learning. You can find examples of how to implement basic computer vision in Deep Learning For Dummies, by John Paul Mueller and Luca Massaron (Wiley).

About This Article

This article can be found in the category: