General Data Science Articles

### A Brief Guide to Understanding Bayes’ Theorem

Before you begin using Bayes’ Theorem to perform practical tasks, knowing a little about its history is helpful. The reason this knowledge is so useful is because Bayes’ Theorem doesn’t seem to be able to do everything it purports to do when you first see it, which is why many statisticians rejected it outright.
After you do have a basic knowledge of how Bayes’ Theorem came into being, you need to look at the theorem itself. The following sections provide you with a history of Bayes’ Theorem that then moves into the theorem itself. Here, Bayes’ Theorem is presented from a practical perspective.

## A little Bayes history

You might wonder why anyone would name an algorithm Naïve Bayes (yet you find this algorithm among the most effective machine learning algorithms in packages such as Scikit-learn). The

*naïve* part comes from its formulation; it makes some extreme simplifications to standard probability calculations. The reference to Bayes in its name relates to the Reverend Bayes and his theorem on probability.
The Reverend Thomas Bayes (1701–1761) was an English statistician and a philosopher who formulated his theorem during the first half of the eighteenth century. Bayes’ Theorem is based on a thought experiment and then a demonstration using the simplest of means.
Reverend Bayes wanted to determine the probability of a future event based on the number of times it occurred in the past. It’s hard to contemplate how to accomplish this task with any accuracy.
The demonstration relied on the use of two balls. An assistant would drop the first ball on a table where the end position of this ball was equally possible in any location, but not tell Bayes its location. The assistant would then drop a second ball, tell Bayes the position of the second ball, and then provide the position of the first ball relative to the location of this second ball
The assistant would then drop the second ball a number of additional times — each time telling Bayes the location of the second ball and the position of the first ball relative to the second. After each toss of the second ball, Bayes would attempt to guess the position of the first. Eventually, he was to guess the position of the first ball based on the evidence provided by the second ball.
The theorem was never published while Bayes was alive. His friend Richard Price found Bayes’ notes after his death in 1761 and published the material for Bayes, but no one seemed to read it at first.

Bayes’ Theorem has deeply revolutionized the theory of probability by introducing the idea of conditional probability — that is, probability conditioned by evidence.
The critics saw problems with Bayes’ Theorem that you can summarize as follows:

- Guessing has no place in rigorous mathematics.
- If Bayes didn’t know what to guess, he would simply assign all possible outcomes an equal probability of occurring.
- Using the prior calculations to make a new guess presented an insurmountable problem.

Often, it takes a problem to illuminate the need for a previously defined solution, which is what happened with Bayes’ Theorem. By the late eighteenth century, the need to study astronomy and make sense of the observations made by the

- Chinese in 1100 BC
- Greeks in 200 BC
- Romans in AD 100
- Arabs in AD 1000

became essential. The readings made by these other civilizations not only reflected social and other biases but also were unreliable because of the differing methods of observation and the technology use.
You might wonder why the study of astronomy suddenly became essential, and the short answer is money.

Navigation of the late eighteenth century relied heavily on accurate celestial observations, so anyone who could make the readings more accurate could reduce the time required to ship goods from one part of the world to another.
Pierre-Simon Laplace wanted to solve the problem, but he couldn’t just dive into the astronomy data without first having a means to dig through all that data to find out which was correct and which wasn’t. He encountered Richard Price, who told him about Bayes’ Theorem.
Laplace used the theorem to solve an easier problem, that of the births of males and females. Some people had noticed that more boys than girls were born each year, but no proof existed for this observation. Laplace used Bayes’ Theorem to prove that more boys are born each year than girls based on birth records.
Other statisticians took notice and started using the theorem, often secretly, for a host of other calculations, such as the calculation of the masses of Jupiter and Saturn from a wide variety of

observations by Alexis Bouvard.

## The basic Bayes theorem formula

When thinking about Bayes’ Theorem, it helps to start from the beginning — that is, probability itself.

*Probability* tells you the likelihood of an event and is expressed in a numeric form.

The probability of an event is measured in the range from 0 to 1 (from 0 percent to 100 percent) and it’s empirically derived from counting the number of times a specific event happens with respect to all the events. You can calculate it from data!

When you observe events (for example, when a feature has a certain characteristic) and you want to estimate the probability associated with the event, you count the number of times the characteristic appears in the data and divide that figure by the total number of observations available. The result is a number ranging from 0 to 1, which expresses the probability.
When you estimate the probability of an event, you tend to believe that you can apply the probability in each situation. The term for this belief is

*a priori* because it constitutes the first estimate of probability with regard to an event (the one that comes to mind first).
For example, if you estimate the probability of an unknown person's being a female, you might say, after some counting, that it’s 50 percent, which is the prior, or the first, probability that you will stick with.
The prior probability can change in the face of evidence, that is, something that can radically modify your expectations.
For example, the evidence of whether a person is male or female could be that the person’s hair is long or short. You can estimate having long hair as an event with 35 percent probability for the general population, but within the female population, it’s 60 percent. If the percentage is higher in the female population, contrary to the general probability (the prior for having long hair), that’s useful information for making a prediction.
Imagine that you have to guess whether a person is male or female and the evidence is that the person has long hair. This sounds like a predictive problem, and in the end, this situation is similar to predicting a categorical variable from data: We have a target variable with different categories and you have to guess the probability of each category based on evidence, the data. Reverend Bayes provided a useful formula:

`P(B|E) = P(E|B)*P(B) / P(E)`

The formula looks like statistical jargon and is a bit counterintuitive, so it needs to be explained in depth. Reading the formula using the previous example as input makes the meaning behind the formula quite a bit clearer:

**P(B|E):** The probability of being a female (the belief B) given long hair (the evidence E). This part of the formula defines what you want to predict. In short, it says to predict y given x where y is an outcome (male or female) and x is the evidence (long or short hair).
**P(E|B):** The probability of having long hair, the evidence of when a person is female. In this case, you already know that it’s 60 percent. In every data problem, you can obtain this figure easily by simple cross-tabulation of the features against the target outcome.
**P(B):** The probability of being a female, which has a 50 percent general chance (a prior).
**P(E):** The probability of having long hair in general, which is 35 percent (another prior).

When reading parts of the formula such as P(B|E), you should read them as follows: probability of B given E. The | symbol translates as given. A probability expressed in this way is a conditional probability, because it’s the probability of a belief, B, conditioned by the evidence presented by E. In this example, plugging the numbers into the formula translates into

`60% * 50% / 35% = 85.7%`

Therefore, getting back to the previous example, even if being a female is a 50 percent probability, just knowing evidence like long hair takes it up to 85.7 percent, which is a more favorable chance for the guess. You can be more confident in guessing that the person with long hair is a female because you have a bit less than a 15 percent chance of being wrong.

General Data Science Articles

### Linear Regression vs. Logistic Regression

Both linear and logistic regression see a lot of use in data science but are commonly used for different kinds of problems. You need to know and understand both types of regression to perform a full range of data science tasks.
Of the two, logistic regression is harder to understand in many respects because it necessarily uses a more complex equation model. The following information gives you a basic overview of how linear and logistic regression differ.

## The equation model

Any discussion of the difference between linear and logistic regression must start with the underlying equation model. The equation for linear regression is straightforward.

y = a + bx

You may see this equation in other forms and you may see it called ordinary least squares regression, but the essential concept is always the same. Depending on the source you use, some of the equations used to express logistic regression can become downright terrifying unless you’re a math major. However, the start of this discussion can use one of the simplest views of logistic regression:

p = f(a + bx)

`>p`

, is equal to the logistic function,

f, applied to two model parameters,

`a`

and

`b`

, and one explanatory variable,

`x`

. When you look at this particular model, you see that it really isn’t all that different from the linear regression model, except that you now feed the result of the linear regression through the logistic function to obtain the required curve.
The output (dependent variable) is a probability ranging from 0 (not going to happen) to 1 (definitely will happen), or a categorization that says something is either part of the category or not part of the category. (You can also perform multiclass categorization, but focus on the binary response for now.) The best way to view the difference between linear regression output and logistic regression output is to say that the following:

**Linear regression is continuous.** A continuous value can take any value within a specified interval (range) of values. For example, no matter how closely the height of two individuals matches, you can always find someone whose height fits between those two individuals. Examples of continuous values include:
**Logistic regression is discrete.** A discrete value has specific values that it can assume. For example, a hospital can admit only a specific number of patients in a given day. You can’t admit half a patient (at least, not alive). Examples of discrete values include:
- Number of people at the fair
- Number of jellybeans in the jar
- Colors of automobiles produced by a vendor

## The logistic function

Of course, now you need to know about the logistic function. You can find a variety of forms of this function as well, but here’s the easiest one to understand:

f(x) = e^{x} / e^{x} + 1

You already know about

`f`

, which is the logistic function, and

`x`

equals the algorithm you want to use, which is

`a + bx `

in this case. That leaves

`e`

, which is the natural logarithm and has an irrational value of 2.718, for the sake of discussion (

check out a better approximation of the whole value). Another way you see this function expressed is

f(x) = 1 / (1 + e^{-x})

Both forms are correct, but the first form is easier to use. Consider a simple problem in which

`a`

, the y-intercept, is 0, and

`">b`

, the slope, is 1. The example uses

`x`

values from –6 to 6. Consequently, the first

`f(x)`

value would look like this when calculated (all values are rounded):

(1) e^{-6} / (1 + e^{-6})
(2) 0.00248 / 1 + 0.00248
(3) 0.002474

As you might expect, an

`x`

value of 0 would result in an

`f(x)`

value of 0.5, and an

`x`

value of 6 would result in an

`f(x)`

value of 0.9975. Obviously, a linear regression would show different results for precisely the same

`x`

values. If you calculate and plot all the results from both logistic and linear regression using the following code, you receive a plot like the one below.

import matplotlib.pyplot as plt
%matplotlib inline
from math import exp
x_values = range(-6, 7)
lin_values = [(0 + 1*x) / 13 for x in range(0, 13)]
log_values = [exp(0 + 1*x) / (1 + exp(0 + 1*x))
for x in x_values]
plt.plot(x_values, lin_values, 'b-^')
plt.plot(x_values, log_values, 'g-*')
plt.legend(['Linear', 'Logistic'])
plt.show()

This example relies on

list comprehension to calculate the values because it makes the calculations clearer. The linear regression uses a different numeric range because you must normalize the values to appear in the 0 to 1 range for comparison. This is also why you divide the calculated values by 13. The

`exp(x)`

call used for the logistic regression raises

`e`

to the power of

`x`

,

`e`^{x}

, as needed for the logistic function.

The model discussed here is simplified, and some math majors out there are probably throwing a temper tantrum of the most profound proportions right now. The Python or R package you use will actually take care of the math in the background, so really, what you need to know is how the math works at a basic level so that you can understand how to use the packages. This section provides what you need to use the packages. However, if you insist on carrying out the calculations the old way, chalk to chalkboard, you’ll likely need a lot more information.

## The problems that logistic regression solves

You can separate logistic regression into several categories. The first is simple logistic regression, in which you have one dependent variable and one independent variable, much as you see in simple linear regression. However, because of how you calculate the logistic regression, you can expect only two kinds of output:

**Classification:** Decides between two available outcomes, such as male or female, yes or no, or high or low. The outcome is dependent on which side of the line a particular data point falls.
**Probability:** Determines the probability that something is true or false. The values true and false can have specific meanings. For example, you might want to know the probability that a particular apple will be yellow or red based on the presence of yellow and red apples in a bin.

## Fit the curve

As part of understanding the difference between linear and logistic regression, consider this grade prediction problem, which lends itself well to linear regression. In the following code, you see the effect of trying to use logistic regression with that data:

x1 = range(0,9)
y1 = (0.25, 0.33, 0.41, 0.53, 0.59,
0.70, 0.78, 0.86, 0.98)
plt.scatter(x1, y1, c='r')
lin_values = [0.242 + 0.0933*x for x in x1]
log_values = [exp(0.242 + .9033*x) /
(1 + exp(0.242 + .9033*x))
for x in range(-4, 5)]
plt.plot(x1, lin_values, 'b-^')
plt.plot(x1, log_values, 'g-*')
plt.legend(['Linear', 'Logistic', 'Org Data'])
plt.show()

The example has undergone a few changes to make it easier to see precisely what is happening. It relies on the same data that was converted from questions answered correctly on the exam to a percentage. If you have 100 questions and you answer 25 of them correctly, you have answered 25 percent (0.25) of them correctly. The values are normalized to produce values between 0 and 1 percent.
As you can see from the image above, the linear regression follows the data points closely. The logistic regression doesn’t. However, logistic regression often is the correct choice when the data points naturally follow the logistic curve, which happens far more often than you might think. You must use the technique that fits your data best, which means using linear regression in this case.

## A pass/fail example

An essential point to remember is that logistic regression works best for probability and classification. Consider that points on an exam ultimately predict passing or failing the course. If you get a certain percentage of the answers correct, you pass, but you fail otherwise. The following code considers the same data used for the example above, but converts it to a pass/fail list. When a student gets at least 70 percent of the questions correct, success is assured.

y2 = [0 if x < 0.70 else 1 for x in y1]
plt.scatter(x1, y2, c='r')
lin_values = [0.242 + 0.0933*x for x in x1]
log_values = [exp(0.242 + .9033*x) /
(1 + exp(0.242 + .9033*x))
for x in range(-4, 5)]
plt.plot(x1, lin_values, 'b-^')
plt.plot(x1, log_values, 'g-*')
plt.legend(['Linear', 'Logistic', 'Org Data'])
plt.show()

This is an example of how

you can use list comprehensions in Python to obtain a required dataset or data transformation. The list comprehension for

`y2`

starts with the continuous data in

`y1`

and turns it into discrete data. Note that the example uses precisely the same equations as before. All that has changed is the manner in which you view the data, as you can see below.
Because of the change in the data, linear regression is no longer the option to choose. Instead, you use logistic regression to fit the data. Take into account that this example really hasn’t done any sort of analysis to optimize the results. The logistic regression fits the data even better if you do so.