Analyzing Data Streams with the Right Recipe

By John Paul Mueller, Luca Massaron

Streaming data needs streaming algorithms, and the key thing to know about streaming algorithms is that, apart a few measures that it can compute exactly, a streaming algorithm necessarily provides approximate results. The algorithm output is almost correct, guessing not the precisely right answer, but close to it.

When dealing with streams, you clearly have to concentrate only on the measures of interest and leave out many details. You could be interested in a statistical measurement, such as mean, minimum, or maximum. Moreover, you could want to count elements in the stream or distinguish old information from new. There are many algorithms to use, depending on the problem, yet the recipes always use the same ingredients. The trick of cooking the perfect stream is to use one or all of these algorithmic tools as ingredients:

  • Sampling: Reduce your stream to a more manageable data size; represent the entire stream or the most recent observations using a shifting data window.
  • Hashing: Reduce infinite stream variety to a limited set of simple integer numbers.
  • Sketching: Create a short summary of the measure you need, removing the less useful details. This approach lets you leverage a simple working storage, which can be your computer’s main memory or its hard disk.

Another characteristic to keep in mind about algorithms operating on streams is their simplicity and low computational complexity. Data streams can be quite fast. Algorithms that require too many calculations can miss essential data, which means that the data is gone forever. When you view the situation in this light, you can appreciate how hash functions prove useful because they’re prompt in transforming inputs into something easier to handle and search because for both operations, complexity is O(1). You can also appreciate the sketching and sampling techniques, which bring about the idea of lossy compression. Lossy compression enables you to represent something complex by using a simpler form. You lose some detail but save a great deal of computer time and storage.

Sampling means drawing a limited set of examples from your stream and treating them as if they represented the entire stream. It is a well-known tool in statistics through which you can make inferences on a larger context (technically called the universe or the population) by using a small part of it.