After you do have a basic knowledge of how Bayes’ Theorem came into being, you need to look at the theorem itself. The following sections provide you with a history of Bayes’ Theorem that then moves into the theorem itself. Here, Bayes’ Theorem is presented from a practical perspective.

## A little Bayes history

You might wonder why anyone would name an algorithm Naïve Bayes (yet you find this algorithm among the most effective machine learning algorithms in packages such as Scikit-learn). The*naïve*part comes from its formulation; it makes some extreme simplifications to standard probability calculations. The reference to Bayes in its name relates to the Reverend Bayes and his theorem on probability.

The Reverend Thomas Bayes (1701–1761) was an English statistician and a philosopher who formulated his theorem during the first half of the eighteenth century. Bayes’ Theorem is based on a thought experiment and then a demonstration using the simplest of means.

Reverend Bayes wanted to determine the probability of a future event based on the number of times it occurred in the past. It’s hard to contemplate how to accomplish this task with any accuracy.

The demonstration relied on the use of two balls. An assistant would drop the first ball on a table where the end position of this ball was equally possible in any location, but not tell Bayes its location. The assistant would then drop a second ball, tell Bayes the position of the second ball, and then provide the position of the first ball relative to the location of this second ball

The assistant would then drop the second ball a number of additional times — each time telling Bayes the location of the second ball and the position of the first ball relative to the second. After each toss of the second ball, Bayes would attempt to guess the position of the first. Eventually, he was to guess the position of the first ball based on the evidence provided by the second ball.

The theorem was never published while Bayes was alive. His friend Richard Price found Bayes’ notes after his death in 1761 and published the material for Bayes, but no one seemed to read it at first. Bayes’ Theorem has deeply revolutionized the theory of probability by introducing the idea of conditional probability — that is, probability conditioned by evidence.

The critics saw problems with Bayes’ Theorem that you can summarize as follows:

- Guessing has no place in rigorous mathematics.
- If Bayes didn’t know what to guess, he would simply assign all possible outcomes an equal probability of occurring.
- Using the prior calculations to make a new guess presented an insurmountable problem.

- Chinese in 1100 BC
- Greeks in 200 BC
- Romans in AD 100
- Arabs in AD 1000

You might wonder why the study of astronomy suddenly became essential, and the short answer is money. Navigation of the late eighteenth century relied heavily on accurate celestial observations, so anyone who could make the readings more accurate could reduce the time required to ship goods from one part of the world to another.

Pierre-Simon Laplace wanted to solve the problem, but he couldn’t just dive into the astronomy data without first having a means to dig through all that data to find out which was correct and which wasn’t. He encountered Richard Price, who told him about Bayes’ Theorem.

Laplace used the theorem to solve an easier problem, that of the births of males and females. Some people had noticed that more boys than girls were born each year, but no proof existed for this observation. Laplace used Bayes’ Theorem to prove that more boys are born each year than girls based on birth records.

Other statisticians took notice and started using the theorem, often secretly, for a host of other calculations, such as the calculation of the masses of Jupiter and Saturn from a wide variety of observations by Alexis Bouvard.

## The basic Bayes theorem formula

When thinking about Bayes’ Theorem, it helps to start from the beginning — that is, probability itself.*Probability*tells you the likelihood of an event and is expressed in a numeric form.

The probability of an event is measured in the range from 0 to 1 (from 0 percent to 100 percent) and it’s empirically derived from counting the number of times a specific event happens with respect to all the events. You can calculate it from data!

When you observe events (for example, when a feature has a certain characteristic) and you want to estimate the probability associated with the event, you count the number of times the characteristic appears in the data and divide that figure by the total number of observations available. The result is a number ranging from 0 to 1, which expresses the probability.When you estimate the probability of an event, you tend to believe that you can apply the probability in each situation. The term for this belief is *a priori* because it constitutes the first estimate of probability with regard to an event (the one that comes to mind first).

For example, if you estimate the probability of an unknown person's being a female, you might say, after some counting, that it’s 50 percent, which is the prior, or the first, probability that you will stick with.

The prior probability can change in the face of evidence, that is, something that can radically modify your expectations.

For example, the evidence of whether a person is male or female could be that the person’s hair is long or short. You can estimate having long hair as an event with 35 percent probability for the general population, but within the female population, it’s 60 percent. If the percentage is higher in the female population, contrary to the general probability (the prior for having long hair), that’s useful information for making a prediction.

Imagine that you have to guess whether a person is male or female and the evidence is that the person has long hair. This sounds like a predictive problem, and in the end, this situation is similar to predicting a categorical variable from data: We have a target variable with different categories and you have to guess the probability of each category based on evidence, the data. Reverend Bayes provided a useful formula:

`P(B|E) = P(E|B)*P(B) / P(E)`

The formula looks like statistical jargon and is a bit counterintuitive, so it needs to be explained in depth. Reading the formula using the previous example as input makes the meaning behind the formula quite a bit clearer:

**P(B|E):**The probability of being a female (the belief B) given long hair (the evidence E). This part of the formula defines what you want to predict. In short, it says to predict y given x where y is an outcome (male or female) and x is the evidence (long or short hair).**P(E|B):**The probability of having long hair, the evidence of when a person is female. In this case, you already know that it’s 60 percent. In every data problem, you can obtain this figure easily by simple cross-tabulation of the features against the target outcome.**P(B):**The probability of being a female, which has a 50 percent general chance (a prior).**P(E):**The probability of having long hair in general, which is 35 percent (another prior).

When reading parts of the formula such as P(B|E), you should read them as follows: probability of B given E. The | symbol translates as given. A probability expressed in this way is a conditional probability, because it’s the probability of a belief, B, conditioned by the evidence presented by E. In this example, plugging the numbers into the formula translates into

`60% * 50% / 35% = 85.7%`

Therefore, getting back to the previous example, even if being a female is a 50 percent probability, just knowing evidence like long hair takes it up to 85.7 percent, which is a more favorable chance for the guess. You can be more confident in guessing that the person with long hair is a female because you have a bit less than a 15 percent chance of being wrong.