Counting Objects in a Data Stream

Learning to count objects in a stream can help you find the most frequent items or rank usual and unusual events. This algorithm leverages hash functions and approximate sketches. It does so after filtering duplicated objects and counting distinct elements that have appeared in the data stream.

You use this technique to solve problems like finding the most frequent queries in a search engine, the bestselling items from an online retailer, the highly popular pages in a website, or the most volatile stocks (by counting the times a stock is sold and bought).

You apply the solution to this problem, Count-Min Sketch, to a data stream. It requires just one data pass and stores as little information as possible. This algorithm is applied in many real-world situations (such as analyzing network traffic or managing distributed data flows). The recipe requires using a bunch of hash functions, each one associated with a bit vector, in a way that resembles a Bloom filter, as shown in the figure:

Initialize all the bit vectors to zeroes in all positions.
Apply the hash function for each bit vector when receiving an object from a stream. Use the resulting numeric address to increment the value at that position.
Apply the hash function to an object and retrieve the value at the associated position when asked to estimate the frequency of an object. Of all the values received from the bit vectors, you take the smallest as the frequency of the stream.

How values are updated in a Count-Min Sketch.

Because collisions are always possible when using a hash function, especially if the associated bit vector has few slots, having multiple bit vectors at hand assures you that at least one of them keeps the correct value. The value of choice should be the smallest because it isn't mixed with false positive counts due to collisions.

About This Article

About the book authors:

John Mueller has produced 114 books and more than 600 articles on topics ranging from functional programming techniques to working with Amazon Web Services (AWS). Luca Massaron, a Google Developer Expert (GDE),??interprets big data and transforms it into smart data through simple and effective data mining and machine learning techniques.

This article can be found in the category:

General Data Science ,

Counting Objects in a Data Stream

Algorithms For Dummies

About This Article

This article is from the book:

About the book authors:

This article can be found in the category:

Article Categories

Book Categories

Collections

Counting Objects in a Data Stream

Algorithms For Dummies

About This Article

This article is from the book:

About the book authors:

This article can be found in the category: