Elements Added to Bloom Filters

Data Science Essentials For Dummies

Generally, you create Bloom filters for algorithms of a fixed size (recently developed versions allow you to resize the filter). You operate them by adding new elements to the filter and looking them up when already present. It's not possible to remove an element from the filter after adding it (the filter has an indelible memory).

When adding an element to a bit vector, the bit vector has some bits set to 1, as shown. In this case, the Bloom filter adds X to the bit vector.

Adding a single element to a bit vector.

You can add as many elements as is necessary to the bit vector. For example, the next figure shows what happens when adding another element, Y, to the bit vector. Note that bit 7 is the same for both X and Y. Consequently, bit 7 represents a collision between X and Y. These collisions are the source of the potential false positives; because of them, the algorithm could say that an element is already added to the bit vector when it isn't. Using a larger bit vector makes collisions less likely and improves the performance of the Bloom filter, but does so at the cost of both space and time.

Adding a second element can cause collisions.

About This Article

About the book author:

John Paul Mueller is a freelance author and technical editor. He has writing in his blood, having produced 100 books and more than 600 articles to date. The topics range from networking to home security and from database management to heads-down programming. John has provided technical services to both Data Based Advisor and Coast Compute magazines.

Luca Massaron is a data scientist specialized in organizing and interpreting big data and transforming it into smart data by means of the simplest and most effective data mining and machine learning techniques. Because of his job as a quantitative marketing consultant and marketing researcher, he has been involved in quantitative data since 2000 with different clients and in various industries, and is one of the top 10 Kaggle data scientists.