A Brief Guide to Understanding Bayes’ Theorem

John Paul Mueller

Luca Massaron

Updated

2020-03-06 4:08:46

From the book

Data Science Programming All-in-One For Dummies

Download E-Book

Data Science Essentials For Dummies

Explore Book

Download E-Book

Data Science Essentials For Dummies

Explore Book

Before you begin using Bayes’ Theorem to perform practical tasks, knowing a little about its history is helpful. The reason this knowledge is so useful is because Bayes’ Theorem doesn’t seem to be able to do everything it purports to do when you first see it, which is why many statisticians rejected it outright.

©Shutterstock/Gearstd

After you do have a basic knowledge of how Bayes’ Theorem came into being, you need to look at the theorem itself. The following sections provide you with a history of Bayes’ Theorem that then moves into the theorem itself. Here, Bayes’ Theorem is presented from a practical perspective.

A little Bayes history

You might wonder why anyone would name an algorithm Naïve Bayes (yet you find this algorithm among the most effective machine learning algorithms in packages such as Scikit-learn). The naïve part comes from its formulation; it makes some extreme simplifications to standard probability calculations. The reference to Bayes in its name relates to the Reverend Bayes and his theorem on probability.

The Reverend Thomas Bayes (1701–1761) was an English statistician and a philosopher who formulated his theorem during the first half of the eighteenth century. Bayes’ Theorem is based on a thought experiment and then a demonstration using the simplest of means.

Reverend Bayes wanted to determine the probability of a future event based on the number of times it occurred in the past. It’s hard to contemplate how to accomplish this task with any accuracy.

The demonstration relied on the use of two balls. An assistant would drop the first ball on a table where the end position of this ball was equally possible in any location, but not tell Bayes its location. The assistant would then drop a second ball, tell Bayes the position of the second ball, and then provide the position of the first ball relative to the location of this second ball

The assistant would then drop the second ball a number of additional times — each time telling Bayes the location of the second ball and the position of the first ball relative to the second. After each toss of the second ball, Bayes would attempt to guess the position of the first. Eventually, he was to guess the position of the first ball based on the evidence provided by the second ball.

The theorem was never published while Bayes was alive. His friend Richard Price found Bayes’ notes after his death in 1761 and published the material for Bayes, but no one seemed to read it at first. Bayes’ Theorem has deeply revolutionized the theory of probability by introducing the idea of conditional probability — that is, probability conditioned by evidence.

The critics saw problems with Bayes’ Theorem that you can summarize as follows:

Guessing has no place in rigorous mathematics.
If Bayes didn’t know what to guess, he would simply assign all possible outcomes an equal probability of occurring.
Using the prior calculations to make a new guess presented an insurmountable problem.

Often, it takes a problem to illuminate the need for a previously defined solution, which is what happened with Bayes’ Theorem. By the late eighteenth century, the need to study astronomy and make sense of the observations made by the

Chinese in 1100 BC
Greeks in 200 BC
Romans in AD 100
Arabs in AD 1000

became essential. The readings made by these other civilizations not only reflected social and other biases but also were unreliable because of the differing methods of observation and the technology use.

You might wonder why the study of astronomy suddenly became essential, and the short answer is money. Navigation of the late eighteenth century relied heavily on accurate celestial observations, so anyone who could make the readings more accurate could reduce the time required to ship goods from one part of the world to another.

Pierre-Simon Laplace wanted to solve the problem, but he couldn’t just dive into the astronomy data without first having a means to dig through all that data to find out which was correct and which wasn’t. He encountered Richard Price, who told him about Bayes’ Theorem.

Laplace used the theorem to solve an easier problem, that of the births of males and females. Some people had noticed that more boys than girls were born each year, but no proof existed for this observation. Laplace used Bayes’ Theorem to prove that more boys are born each year than girls based on birth records.

Other statisticians took notice and started using the theorem, often secretly, for a host of other calculations, such as the calculation of the masses of Jupiter and Saturn from a wide variety of observations by Alexis Bouvard.

The basic Bayes theorem formula

When thinking about Bayes’ Theorem, it helps to start from the beginning — that is, probability itself. Probability tells you the likelihood of an event and is expressed in a numeric form.

The probability of an event is measured in the range from 0 to 1 (from 0 percent to 100 percent) and it’s empirically derived from counting the number of times a specific event happens with respect to all the events. You can calculate it from data!

When you observe events (for example, when a feature has a certain characteristic) and you want to estimate the probability associated with the event, you count the number of times the characteristic appears in the data and divide that figure by the total number of observations available. The result is a number ranging from 0 to 1, which expresses the probability.

When you estimate the probability of an event, you tend to believe that you can apply the probability in each situation. The term for this belief is a priori because it constitutes the first estimate of probability with regard to an event (the one that comes to mind first).

For example, if you estimate the probability of an unknown person's being a female, you might say, after some counting, that it’s 50 percent, which is the prior, or the first, probability that you will stick with.

The prior probability can change in the face of evidence, that is, something that can radically modify your expectations.

For example, the evidence of whether a person is male or female could be that the person’s hair is long or short. You can estimate having long hair as an event with 35 percent probability for the general population, but within the female population, it’s 60 percent. If the percentage is higher in the female population, contrary to the general probability (the prior for having long hair), that’s useful information for making a prediction.

Imagine that you have to guess whether a person is male or female and the evidence is that the person has long hair. This sounds like a predictive problem, and in the end, this situation is similar to predicting a categorical variable from data: We have a target variable with different categories and you have to guess the probability of each category based on evidence, the data. Reverend Bayes provided a useful formula:

P(B|E) = P(E|B)*P(B) / P(E)

The formula looks like statistical jargon and is a bit counterintuitive, so it needs to be explained in depth. Reading the formula using the previous example as input makes the meaning behind the formula quite a bit clearer:

P(B|E): The probability of being a female (the belief B) given long hair (the evidence E). This part of the formula defines what you want to predict. In short, it says to predict y given x where y is an outcome (male or female) and x is the evidence (long or short hair).
P(E|B): The probability of having long hair, the evidence of when a person is female. In this case, you already know that it’s 60 percent. In every data problem, you can obtain this figure easily by simple cross-tabulation of the features against the target outcome.
P(B): The probability of being a female, which has a 50 percent general chance (a prior).
P(E): The probability of having long hair in general, which is 35 percent (another prior).

When reading parts of the formula such as P(B|E), you should read them as follows: probability of B given E. The | symbol translates as given. A probability expressed in this way is a conditional probability, because it’s the probability of a belief, B, conditioned by the evidence presented by E. In this example, plugging the numbers into the formula translates into

60% * 50% / 35% = 85.7%

Therefore, getting back to the previous example, even if being a female is a 50 percent probability, just knowing evidence like long hair takes it up to 85.7 percent, which is a more favorable chance for the guess. You can be more confident in guessing that the person with long hair is a female because you have a bit less than a 15 percent chance of being wrong.

About This Article

About the book author:

John Paul Mueller is a freelance author and technical editor. He has writing in his blood, having produced 100 books and more than 600 articles to date. The topics range from networking to home security and from database management to heads-down programming. John has provided technical services to both Data Based Advisor and Coast Compute magazines.

Luca Massaron is a data scientist specialized in organizing and interpreting big data and transforming it into smart data by means of the simplest and most effective data mining and machine learning techniques. Because of his job as a quantitative marketing consultant and marketing researcher, he has been involved in quantitative data since 2000 with different clients and in various industries, and is one of the top 10 Kaggle data scientists.

This article can be found in the category:

General Data Science

Hot off the press

Explore Related content

Data Science Essentials For Dummies

Data Analytics & Visualization All-in-One For Dummies

Tableau For Dummies

Microsoft Power BI For Dummies

Decision Intelligence For Dummies

Data Lakes For Dummies

SAS For Dummies

Predictive Analytics For Dummies

Data Science Programming All-in-One For Dummies

Data Science Strategy For Dummies

Data Science For Dummies

Data Mining For Dummies

Blockchain Data Analytics For Dummies

Algorithms For Dummies

Book & Article Categories

Book & Article Categories

Collections

A Brief Guide to Understanding Bayes’ Theorem

A little Bayes history

The basic Bayes theorem formula

About This Article

About the book author:

This article can be found in the category:

Explore Related content

Book & Article Categories

Book & Article Categories

Collections

A Brief Guide to Understanding Bayes’ Theorem

A little Bayes history

The basic Bayes theorem formula

About This Article

This article is from the book:

About the book author:

This article can be found in the category:

Explore Related content

Linear Regression vs. Logistic Regression

Data Analytics & Visualization All-in-One Cheat Sheet

Decision Intelligence For Dummies Cheat Sheet

Microsoft Power BI For Dummies Cheat Sheet

Laws and Regulations You Should Know for Blockchain Data Analysis Projects

Aligning Blockchain Data with Real-World Business Processes

An Intro to Aligning Blockchain Data Analytics with Business Goals

Blockchain Use Cases

Fitting Blockchain into Today’s Business Processes

A Quick Comparison of Blockchain Data Analytics Toolsets and Frameworks

The Primary Types of Blockchain

10 Tools for Developing Blockchain Analytics Models

10 Tips for Visualizing Blockchain Data

10 Uses for Blockchain Analytics

Blockchain Data Analytics For Dummies Cheat Sheet

How to Perform Pattern Matching in Python

How Data is Collected and Why It Can Be Problematic

The Need for Reliable Sources in Data Science Applications

How Pattern Matching Works in Data Science

The Basics of Deep Learning Framework Usage and Low-End Framework Options