Algorithms For Dummies
Book image
Explore Book Buy On Amazon
Streaming data needs streaming algorithms, and the key thing to know about streaming algorithms is that, apart a few measures that it can compute exactly, a streaming algorithm necessarily provides approximate results. The algorithm output is almost correct, guessing not the precisely right answer, but close to it.

When dealing with streams, you clearly have to concentrate only on the measures of interest and leave out many details. You could be interested in a statistical measurement, such as mean, minimum, or maximum. Moreover, you could want to count elements in the stream or distinguish old information from new. There are many algorithms to use, depending on the problem, yet the recipes always use the same ingredients. The trick of cooking the perfect stream is to use one or all of these algorithmic tools as ingredients:

  • Sampling: Reduce your stream to a more manageable data size; represent the entire stream or the most recent observations using a shifting data window.
  • Hashing: Reduce infinite stream variety to a limited set of simple integer numbers.
  • Sketching: Create a short summary of the measure you need, removing the less useful details. This approach lets you leverage a simple working storage, which can be your computer's main memory or its hard disk.
Another characteristic to keep in mind about algorithms operating on streams is their simplicity and low computational complexity. Data streams can be quite fast. Algorithms that require too many calculations can miss essential data, which means that the data is gone forever. When you view the situation in this light, you can appreciate how hash functions prove useful because they're prompt in transforming inputs into something easier to handle and search because for both operations, complexity is O(1). You can also appreciate the sketching and sampling techniques, which bring about the idea of lossy compression. Lossy compression enables you to represent something complex by using a simpler form. You lose some detail but save a great deal of computer time and storage.

Sampling means drawing a limited set of examples from your stream and treating them as if they represented the entire stream. It is a well-known tool in statistics through which you can make inferences on a larger context (technically called the universe or the population) by using a small part of it.

About This Article

This article is from the book:

About the book authors:

John Paul Mueller has produced 102 books and more than 600 articles to date on topics ranging from networking to machine learning. Luca Massaron is a data scientist specializing in organizing and interpreting big data and transforming it into smart data by means of the simplest and most effective data mining and machine learning techniques.

This article can be found in the category: