Algorithms For Dummies
Book image
Explore Book Buy On Amazon
Learning to count objects in a stream can help you find the most frequent items or rank usual and unusual events. This algorithm leverages hash functions and approximate sketches. It does so after filtering duplicated objects and counting distinct elements that have appeared in the data stream.

You use this technique to solve problems like finding the most frequent queries in a search engine, the bestselling items from an online retailer, the highly popular pages in a website, or the most volatile stocks (by counting the times a stock is sold and bought).

You apply the solution to this problem, Count-Min Sketch, to a data stream. It requires just one data pass and stores as little information as possible. This algorithm is applied in many real-world situations (such as analyzing network traffic or managing distributed data flows). The recipe requires using a bunch of hash functions, each one associated with a bit vector, in a way that resembles a Bloom filter, as shown in the figure:

  1. Initialize all the bit vectors to zeroes in all positions.
  2. Apply the hash function for each bit vector when receiving an object from a stream. Use the resulting numeric address to increment the value at that position.
  3. Apply the hash function to an object and retrieve the value at the associated position when asked to estimate the frequency of an object. Of all the values received from the bit vectors, you take the smallest as the frequency of the stream.
How values are updated in a Count-Min Sketch.

Because collisions are always possible when using a hash function, especially if the associated bit vector has few slots, having multiple bit vectors at hand assures you that at least one of them keeps the correct value. The value of choice should be the smallest because it isn't mixed with false positive counts due to collisions.

About This Article

This article is from the book:

About the book authors:

John Paul Mueller has produced 102 books and more than 600 articles to date on topics ranging from networking to machine learning. Luca Massaron is a data scientist specializing in organizing and interpreting big data and transforming it into smart data by means of the simplest and most effective data mining and machine learning techniques.

This article can be found in the category: