Data Science Programming All-in-One For Dummies book cover

Data Science Programming All-in-One For Dummies

Authors:
John Paul Mueller ,
Luca Massaron
Published: January 9, 2020

Overview

Your logical, linear guide to the fundamentals of data science programming

Data science is exploding—in a good way—with a forecast of 1.7 megabytes of new information created every second for each human being on the planet by 2020 and 11.5 million job openings by 2026. It clearly pays dividends to be in the know. This friendly guide charts a path through the fundamentals of data science and then delves into the actual work: linear regression, logical regression, machine learning, neural networks, recommender engines, and cross-validation of models.

Data Science Programming All-In-One For Dummies is a compilation of the key data science, machine learning, and deep learning programming languages: Python and R. It helps you decide which programming languages are best for specific data science needs. It also gives you the guidelines to build your own projects to solve problems in real time.

  • Get grounded: the ideal start for new data professionals
  • What lies ahead: learn about specific areas that data is transforming  
  • Be meaningful: find out how to tell your data story
  • See clearly: pick up the art of visualization

Whether you’re a beginning student or already mid-career, get your copy now and add even more meaning to your life—and everyone else’s!

Your logical, linear guide to the fundamentals of data science programming

Data science is exploding—in a good way—with a forecast of 1.7 megabytes of new information created every second for each human being on the planet by 2020 and 11.5 million job openings by 2026. It clearly pays dividends to be in the know. This friendly guide charts a path through the fundamentals of data science and then delves into the actual work: linear regression, logical regression, machine learning, neural networks, recommender engines, and cross-validation of models.

Data Science Programming All-In-One For Dummies is a compilation of the key data science, machine

learning, and deep learning programming languages: Python and R. It helps you decide which programming languages are best for specific data science needs. It also gives you the guidelines to build your own projects to solve problems in real time.

  • Get grounded: the ideal start for new data professionals
  • What lies ahead: learn about specific areas that data is transforming  
  • Be meaningful: find out how to tell your data story
  • See clearly: pick up the art of visualization

Whether you’re a beginning student or already mid-career, get your copy now and add even more meaning to your life—and everyone else’s!

Data Science Programming All-in-One For Dummies Cheat Sheet

Data science affects many different technologies in a profound manner. Our society runs on data today, so you can’t do many things that aren’t affected by it in some way. Even the timing of stoplights depends on data collected by the highway department. Your food shopping experience depends on data collected from Point of Sale (POS) terminals, surveys, farming data, and sources you can’t even begin to imagine. No matter how you use data, this cheat sheet will help you use it more effectively. [caption id="attachment_266848" align="alignnone" width="556"] ©carlos castilla/Shutterstock.com[/caption]

Articles From The Book

10 results

General Data Science Articles

A Brief Guide to Understanding Bayes’ Theorem

Before you begin using Bayes’ Theorem to perform practical tasks, knowing a little about its history is helpful. The reason this knowledge is so useful is because Bayes’ Theorem doesn’t seem to be able to do everything it purports to do when you first see it, which is why many statisticians rejected it outright. After you do have a basic knowledge of how Bayes’ Theorem came into being, you need to look at the theorem itself. The following sections provide you with a history of Bayes’ Theorem that then moves into the theorem itself. Here, Bayes’ Theorem is presented from a practical perspective.

A little Bayes history

You might wonder why anyone would name an algorithm Naïve Bayes (yet you find this algorithm among the most effective
machine learning algorithms in packages such as Scikit-learn). The naïve part comes from its formulation; it makes some extreme simplifications to standard probability calculations. The reference to Bayes in its name relates to the Reverend Bayes and his theorem on probability. The Reverend Thomas Bayes (1701–1761) was an English statistician and a philosopher who formulated his theorem during the first half of the eighteenth century. Bayes’ Theorem is based on a thought experiment and then a demonstration using the simplest of means. Reverend Bayes wanted to determine the probability of a future event based on the number of times it occurred in the past. It’s hard to contemplate how to accomplish this task with any accuracy. The demonstration relied on the use of two balls. An assistant would drop the first ball on a table where the end position of this ball was equally possible in any location, but not tell Bayes its location. The assistant would then drop a second ball, tell Bayes the position of the second ball, and then provide the position of the first ball relative to the location of this second ball The assistant would then drop the second ball a number of additional times — each time telling Bayes the location of the second ball and the position of the first ball relative to the second. After each toss of the second ball, Bayes would attempt to guess the position of the first. Eventually, he was to guess the position of the first ball based on the evidence provided by the second ball. The theorem was never published while Bayes was alive. His friend Richard Price found Bayes’ notes after his death in 1761 and published the material for Bayes, but no one seemed to read it at first. Bayes’ Theorem has deeply revolutionized the theory of probability by introducing the idea of conditional probability — that is, probability conditioned by evidence. The critics saw problems with Bayes’ Theorem that you can summarize as follows:
  • Guessing has no place in rigorous mathematics.
  • If Bayes didn’t know what to guess, he would simply assign all possible outcomes an equal probability of occurring.
  • Using the prior calculations to make a new guess presented an insurmountable problem.
Often, it takes a problem to illuminate the need for a previously defined solution, which is what happened with Bayes’ Theorem. By the late eighteenth century, the need to study astronomy and make sense of the observations made by the
  • Chinese in 1100 BC
  • Greeks in 200 BC
  • Romans in AD 100
  • Arabs in AD 1000
became essential. The readings made by these other civilizations not only reflected social and other biases but also were unreliable because of the differing methods of observation and the technology use. You might wonder why the study of astronomy suddenly became essential, and the short answer is money. Navigation of the late eighteenth century relied heavily on accurate celestial observations, so anyone who could make the readings more accurate could reduce the time required to ship goods from one part of the world to another. Pierre-Simon Laplace wanted to solve the problem, but he couldn’t just dive into the astronomy data without first having a means to dig through all that data to find out which was correct and which wasn’t. He encountered Richard Price, who told him about Bayes’ Theorem. Laplace used the theorem to solve an easier problem, that of the births of males and females. Some people had noticed that more boys than girls were born each year, but no proof existed for this observation. Laplace used Bayes’ Theorem to prove that more boys are born each year than girls based on birth records. Other statisticians took notice and started using the theorem, often secretly, for a host of other calculations, such as the calculation of the masses of Jupiter and Saturn from a wide variety of observations by Alexis Bouvard.

The basic Bayes theorem formula

When thinking about Bayes’ Theorem, it helps to start from the beginning — that is, probability itself. Probability tells you the likelihood of an event and is expressed in a numeric form.

The probability of an event is measured in the range from 0 to 1 (from 0 percent to 100 percent) and it’s empirically derived from counting the number of times a specific event happens with respect to all the events. You can calculate it from data!

When you observe events (for example, when a feature has a certain characteristic) and you want to estimate the probability associated with the event, you count the number of times the characteristic appears in the data and divide that figure by the total number of observations available. The result is a number ranging from 0 to 1, which expresses the probability. When you estimate the probability of an event, you tend to believe that you can apply the probability in each situation. The term for this belief is a priori because it constitutes the first estimate of probability with regard to an event (the one that comes to mind first). For example, if you estimate the probability of an unknown person's being a female, you might say, after some counting, that it’s 50 percent, which is the prior, or the first, probability that you will stick with. The prior probability can change in the face of evidence, that is, something that can radically modify your expectations. For example, the evidence of whether a person is male or female could be that the person’s hair is long or short. You can estimate having long hair as an event with 35 percent probability for the general population, but within the female population, it’s 60 percent. If the percentage is higher in the female population, contrary to the general probability (the prior for having long hair), that’s useful information for making a prediction. Imagine that you have to guess whether a person is male or female and the evidence is that the person has long hair. This sounds like a predictive problem, and in the end, this situation is similar to predicting a categorical variable from data: We have a target variable with different categories and you have to guess the probability of each category based on evidence, the data. Reverend Bayes provided a useful formula: P(B|E) = P(E|B)*P(B) / P(E) The formula looks like statistical jargon and is a bit counterintuitive, so it needs to be explained in depth. Reading the formula using the previous example as input makes the meaning behind the formula quite a bit clearer:
  • P(B|E): The probability of being a female (the belief B) given long hair (the evidence E). This part of the formula defines what you want to predict. In short, it says to predict y given x where y is an outcome (male or female) and x is the evidence (long or short hair).
  • P(E|B): The probability of having long hair, the evidence of when a person is female. In this case, you already know that it’s 60 percent. In every data problem, you can obtain this figure easily by simple cross-tabulation of the features against the target outcome.
  • P(B): The probability of being a female, which has a 50 percent general chance (a prior).
  • P(E): The probability of having long hair in general, which is 35 percent (another prior).

When reading parts of the formula such as P(B|E), you should read them as follows: probability of B given E. The | symbol translates as given. A probability expressed in this way is a conditional probability, because it’s the probability of a belief, B, conditioned by the evidence presented by E. In this example, plugging the numbers into the formula translates into

60% * 50% / 35% = 85.7% Therefore, getting back to the previous example, even if being a female is a 50 percent probability, just knowing evidence like long hair takes it up to 85.7 percent, which is a more favorable chance for the guess. You can be more confident in guessing that the person with long hair is a female because you have a bit less than a 15 percent chance of being wrong.

General Data Science Articles

Linear Regression vs. Logistic Regression

Both linear and logistic regression see a lot of use in data science but are commonly used for different kinds of problems. You need to know and understand both types of regression to perform a full range of data science tasks. Of the two, logistic regression is harder to understand in many respects because it necessarily uses a more complex equation model. The following information gives you a basic overview of how linear and logistic regression differ.

The equation model

Any discussion of the difference between linear and logistic regression must start with the underlying equation model. The equation for linear regression is straightforward.
y = a + bx
You may see this equation in other forms and you may see it called ordinary least squares regression, but the essential concept is always the same. Depending on the source you use, some of the equations used to express logistic regression can become downright terrifying unless you’re a math major. However, the start of this discussion can use one of the simplest views of logistic regression:
p = f(a + bx)
>p, is equal to the logistic function, f, applied to two model parameters, a and b, and one explanatory variable, x. When you look at this particular model, you see that it really isn’t all that different from the linear regression model, except that you now feed the result of the linear regression through the logistic function to obtain the required curve. The output (dependent variable) is a probability ranging from 0 (not going to happen) to 1 (definitely will happen), or a categorization that says something is either part of the category or not part of the category. (You can also perform multiclass categorization, but focus on the binary response for now.) The best way to view the difference between linear regression output and logistic regression output is to say that the following:
  • Linear regression is continuous. A continuous value can take any value within a specified interval (range) of values. For example, no matter how closely the height of two individuals matches, you can always find someone whose height fits between those two individuals. Examples of continuous values include:
    • Height
    • Weight
    • Waist size
  • Logistic regression is discrete. A discrete value has specific values that it can assume. For example, a hospital can admit only a specific number of patients in a given day. You can’t admit half a patient (at least, not alive). Examples of discrete values include:
    • Number of people at the fair
    • Number of jellybeans in the jar
    • Colors of automobiles produced by a vendor

The logistic function

Of course, now you need to know about the logistic function. You can find a variety of forms of this function as well, but here’s the easiest one to understand:
f(x) = ex / ex + 1
You already know about f, which is the logistic function, and x equals the algorithm you want to use, which is a + bx in this case. That leaves e, which is the natural logarithm and has an irrational value of 2.718, for the sake of discussion (check out a better approximation of the whole value). Another way you see this function expressed is
f(x) = 1 / (1 + e-x)
Both forms are correct, but the first form is easier to use. Consider a simple problem in which a, the y-intercept, is 0, and ">b, the slope, is 1. The example uses x values from –6 to 6. Consequently, the first f(x) value would look like this when calculated (all values are rounded):
 
(1) e-6 / (1 + e-6)
(2) 0.00248 / 1 + 0.00248
(3) 0.002474
As you might expect, an xvalue of 0 would result in an f(x) value of 0.5, and an x value of 6 would result in an f(x) value of 0.9975. Obviously, a linear regression would show different results for precisely the same x values. If you calculate and plot all the results from both logistic and linear regression using the following code, you receive a plot like the one below.
import matplotlib.pyplot as plt
%matplotlib inline
from math import exp
 
x_values = range(-6, 7)
lin_values = [(0 + 1*x) / 13 for x in range(0, 13)]
log_values = [exp(0 + 1*x) / (1 + exp(0 + 1*x))
for x in x_values]
 
plt.plot(x_values, lin_values, 'b-^')
plt.plot(x_values, log_values, 'g-*')
plt.legend(['Linear', 'Logistic'])
plt.show()
This example relies on list comprehension to calculate the values because it makes the calculations clearer. The linear regression uses a different numeric range because you must normalize the values to appear in the 0 to 1 range for comparison. This is also why you divide the calculated values by 13. The exp(x) call used for the logistic regression raises e to the power of x, ex, as needed for the logistic function.

The model discussed here is simplified, and some math majors out there are probably throwing a temper tantrum of the most profound proportions right now. The Python or R package you use will actually take care of the math in the background, so really, what you need to know is how the math works at a basic level so that you can understand how to use the packages. This section provides what you need to use the packages. However, if you insist on carrying out the calculations the old way, chalk to chalkboard, you’ll likely need a lot more information.

The problems that logistic regression solves

You can separate logistic regression into several categories. The first is simple logistic regression, in which you have one dependent variable and one independent variable, much as you see in simple linear regression. However, because of how you calculate the logistic regression, you can expect only two kinds of output:
  • Classification: Decides between two available outcomes, such as male or female, yes or no, or high or low. The outcome is dependent on which side of the line a particular data point falls.
  • Probability: Determines the probability that something is true or false. The values true and false can have specific meanings. For example, you might want to know the probability that a particular apple will be yellow or red based on the presence of yellow and red apples in a bin.

Fit the curve

As part of understanding the difference between linear and logistic regression, consider this grade prediction problem, which lends itself well to linear regression. In the following code, you see the effect of trying to use logistic regression with that data:
x1 = range(0,9)
y1 = (0.25, 0.33, 0.41, 0.53, 0.59,
0.70, 0.78, 0.86, 0.98)
plt.scatter(x1, y1, c='r')
 
lin_values = [0.242 + 0.0933*x for x in x1]
log_values = [exp(0.242 + .9033*x) /
(1 + exp(0.242 + .9033*x))
for x in range(-4, 5)]
 
plt.plot(x1, lin_values, 'b-^')
plt.plot(x1, log_values, 'g-*')
plt.legend(['Linear', 'Logistic', 'Org Data'])
plt.show()
The example has undergone a few changes to make it easier to see precisely what is happening. It relies on the same data that was converted from questions answered correctly on the exam to a percentage. If you have 100 questions and you answer 25 of them correctly, you have answered 25 percent (0.25) of them correctly. The values are normalized to produce values between 0 and 1 percent. As you can see from the image above, the linear regression follows the data points closely. The logistic regression doesn’t. However, logistic regression often is the correct choice when the data points naturally follow the logistic curve, which happens far more often than you might think. You must use the technique that fits your data best, which means using linear regression in this case.

A pass/fail example

An essential point to remember is that logistic regression works best for probability and classification. Consider that points on an exam ultimately predict passing or failing the course. If you get a certain percentage of the answers correct, you pass, but you fail otherwise. The following code considers the same data used for the example above, but converts it to a pass/fail list. When a student gets at least 70 percent of the questions correct, success is assured.
y2 = [0 if x < 0.70 else 1 for x in y1]
plt.scatter(x1, y2, c='r')
 
lin_values = [0.242 + 0.0933*x for x in x1]
log_values = [exp(0.242 + .9033*x) /
(1 + exp(0.242 + .9033*x))
for x in range(-4, 5)]
 
plt.plot(x1, lin_values, 'b-^')
plt.plot(x1, log_values, 'g-*')
plt.legend(['Linear', 'Logistic', 'Org Data'])
plt.show()
This is an example of how you can use list comprehensions in Python to obtain a required dataset or data transformation. The list comprehension for y2 starts with the continuous data in y1 and turns it into discrete data. Note that the example uses precisely the same equations as before. All that has changed is the manner in which you view the data, as you can see below. Because of the change in the data, linear regression is no longer the option to choose. Instead, you use logistic regression to fit the data. Take into account that this example really hasn’t done any sort of analysis to optimize the results. The logistic regression fits the data even better if you do so.

General Data Science Articles

How Data is Collected and Why It Can Be Problematic

Because data is so valuable and users are sometimes adverse to giving it up, vendors constantly find new ways to collect data. One such method comes down to spying. Microsoft, for example, was recently accused (yet again) of spying on Windows 10 users even when the user doesn’t want their data collected. Lest you think that Microsoft is solely interested in your computing concerns, think again. The data Microsoft admits to collecting (and there is likely more) is pretty amazing. Microsoft’s data gathering doesn’t stop with your Windows 10 actions; it also collects data with Cortana, the personal assistant. Mind you, Alexa is accused of doing the same thing. Google, likewise, does the same thing. So, one of the trends the vendors are using is spying, and it doesn’t stop with Microsoft, nor does it stop with the obvious spying sources. It might actually be possible to write an entire book on the ways in which people are spying on you, but that would make for a very paranoid book, and there are other new data collection trends to consider. You may have noticed that you get more email from everyone about the services or products you were provided. Everyone wants you to provide free information about your experiences in one of these forms:

  • Close-ended surveys: A close-ended survey is one in which the questions have specific answers that you check mark. The advantage is greater consistency of feedback. The disadvantage is that you can’t learn anything beyond the predefined answers.
  • Open-ended surveys: An open-ended survey is one in which the questions rely on text boxes in which the user enters data manually. In some cases, this form of survey enables you to find new information, but at the cost of consistency, reliability, and cleanliness of the data.
  • One-on-one interviews: Someone calls you or approaches you at a place like the mall and talks to you. When the interviewer is well trained, you obtain consistent data and can also discover new information. However, the quality of this information comes at the cost of paying someone to obtain it.
  • Focus group: Three or more people meet with an interviewer to discuss a topic (including products). Because the interviewer acts as a moderator, the consistency, reliability, and cleanliness of the data remain high and the costs are lower. However, now the data suffers contamination from the interaction between members of the focus group.
  • Direct observation: No conversation occurs in this case; someone monitors the interactions of another party with a product or service and records the responses using a script. However, because you now rely on a third party to interpret someone else’s actions, you have a problem with contamination in the form of bias. In addition, if the subject of the observation is aware of being monitored, the interactions likely won’t reflect reality.

These are just a few of the methods that are seeing greater use in data collection today. They’re just the tip of the iceberg. The key takeaway here is that no perfect means exists for collecting some types of data and all data collection methods require some sort of participative event.

Don’t want to find yourself in trouble? Here are ten mistakes to avoid when investing in data science.