Data Science Programming All-in-One For Dummies book cover

Data Science Programming All-in-One For Dummies

By: John Paul Mueller and Luca Massaron Published: 01-09-2020

Your logical, linear guide to the fundamentals of data science programming

Data science is exploding—in a good way—with a forecast of 1.7 megabytes of new information created every second for each human being on the planet by 2020 and 11.5 million job openings by 2026. It clearly pays dividends to be in the know. This friendly guide charts a path through the fundamentals of data science and then delves into the actual work: linear regression, logical regression, machine learning, neural networks, recommender engines, and cross-validation of models.

Data Science Programming All-In-One For Dummies is a compilation of the key data science, machine learning, and deep learning programming languages: Python and R. It helps you decide which programming languages are best for specific data science needs. It also gives you the guidelines to build your own projects to solve problems in real time.

  • Get grounded: the ideal start for new data professionals
  • What lies ahead: learn about specific areas that data is transforming  
  • Be meaningful: find out how to tell your data story
  • See clearly: pick up the art of visualization

Whether you’re a beginning student or already mid-career, get your copy now and add even more meaning to your life—and everyone else’s!

Articles From Data Science Programming All-in-One For Dummies

page 1
page 2
11 results
11 results
Data Science Programming All-in-One For Dummies Cheat Sheet

Cheat Sheet / Updated 04-25-2022

Data science affects many different technologies in a profound manner. Our society runs on data today, so you can’t do many things that aren’t affected by it in some way. Even the timing of stoplights depends on data collected by the highway department. Your food shopping experience depends on data collected from Point of Sale (POS) terminals, surveys, farming data, and sources you can’t even begin to imagine. No matter how you use data, this cheat sheet will help you use it more effectively.

View Cheat Sheet
A Brief Guide to Understanding Bayes’ Theorem

Article / Updated 03-06-2020

Before you begin using Bayes’ Theorem to perform practical tasks, knowing a little about its history is helpful. The reason this knowledge is so useful is because Bayes’ Theorem doesn’t seem to be able to do everything it purports to do when you first see it, which is why many statisticians rejected it outright. After you do have a basic knowledge of how Bayes’ Theorem came into being, you need to look at the theorem itself. The following sections provide you with a history of Bayes’ Theorem that then moves into the theorem itself. Here, Bayes’ Theorem is presented from a practical perspective. A little Bayes history You might wonder why anyone would name an algorithm Naïve Bayes (yet you find this algorithm among the most effective machine learning algorithms in packages such as Scikit-learn). The naïve part comes from its formulation; it makes some extreme simplifications to standard probability calculations. The reference to Bayes in its name relates to the Reverend Bayes and his theorem on probability. The Reverend Thomas Bayes (1701–1761) was an English statistician and a philosopher who formulated his theorem during the first half of the eighteenth century. Bayes’ Theorem is based on a thought experiment and then a demonstration using the simplest of means. Reverend Bayes wanted to determine the probability of a future event based on the number of times it occurred in the past. It’s hard to contemplate how to accomplish this task with any accuracy. The demonstration relied on the use of two balls. An assistant would drop the first ball on a table where the end position of this ball was equally possible in any location, but not tell Bayes its location. The assistant would then drop a second ball, tell Bayes the position of the second ball, and then provide the position of the first ball relative to the location of this second ball The assistant would then drop the second ball a number of additional times — each time telling Bayes the location of the second ball and the position of the first ball relative to the second. After each toss of the second ball, Bayes would attempt to guess the position of the first. Eventually, he was to guess the position of the first ball based on the evidence provided by the second ball. The theorem was never published while Bayes was alive. His friend Richard Price found Bayes’ notes after his death in 1761 and published the material for Bayes, but no one seemed to read it at first. Bayes’ Theorem has deeply revolutionized the theory of probability by introducing the idea of conditional probability — that is, probability conditioned by evidence. The critics saw problems with Bayes’ Theorem that you can summarize as follows: Guessing has no place in rigorous mathematics. If Bayes didn’t know what to guess, he would simply assign all possible outcomes an equal probability of occurring. Using the prior calculations to make a new guess presented an insurmountable problem. Often, it takes a problem to illuminate the need for a previously defined solution, which is what happened with Bayes’ Theorem. By the late eighteenth century, the need to study astronomy and make sense of the observations made by the Chinese in 1100 BC Greeks in 200 BC Romans in AD 100 Arabs in AD 1000 became essential. The readings made by these other civilizations not only reflected social and other biases but also were unreliable because of the differing methods of observation and the technology use. You might wonder why the study of astronomy suddenly became essential, and the short answer is money. Navigation of the late eighteenth century relied heavily on accurate celestial observations, so anyone who could make the readings more accurate could reduce the time required to ship goods from one part of the world to another. Pierre-Simon Laplace wanted to solve the problem, but he couldn’t just dive into the astronomy data without first having a means to dig through all that data to find out which was correct and which wasn’t. He encountered Richard Price, who told him about Bayes’ Theorem. Laplace used the theorem to solve an easier problem, that of the births of males and females. Some people had noticed that more boys than girls were born each year, but no proof existed for this observation. Laplace used Bayes’ Theorem to prove that more boys are born each year than girls based on birth records. Other statisticians took notice and started using the theorem, often secretly, for a host of other calculations, such as the calculation of the masses of Jupiter and Saturn from a wide variety of observations by Alexis Bouvard. The basic Bayes theorem formula When thinking about Bayes’ Theorem, it helps to start from the beginning — that is, probability itself. Probability tells you the likelihood of an event and is expressed in a numeric form. The probability of an event is measured in the range from 0 to 1 (from 0 percent to 100 percent) and it’s empirically derived from counting the number of times a specific event happens with respect to all the events. You can calculate it from data! When you observe events (for example, when a feature has a certain characteristic) and you want to estimate the probability associated with the event, you count the number of times the characteristic appears in the data and divide that figure by the total number of observations available. The result is a number ranging from 0 to 1, which expresses the probability. When you estimate the probability of an event, you tend to believe that you can apply the probability in each situation. The term for this belief is a priori because it constitutes the first estimate of probability with regard to an event (the one that comes to mind first). For example, if you estimate the probability of an unknown person's being a female, you might say, after some counting, that it’s 50 percent, which is the prior, or the first, probability that you will stick with. The prior probability can change in the face of evidence, that is, something that can radically modify your expectations. For example, the evidence of whether a person is male or female could be that the person’s hair is long or short. You can estimate having long hair as an event with 35 percent probability for the general population, but within the female population, it’s 60 percent. If the percentage is higher in the female population, contrary to the general probability (the prior for having long hair), that’s useful information for making a prediction. Imagine that you have to guess whether a person is male or female and the evidence is that the person has long hair. This sounds like a predictive problem, and in the end, this situation is similar to predicting a categorical variable from data: We have a target variable with different categories and you have to guess the probability of each category based on evidence, the data. Reverend Bayes provided a useful formula: P(B|E) = P(E|B)*P(B) / P(E) The formula looks like statistical jargon and is a bit counterintuitive, so it needs to be explained in depth. Reading the formula using the previous example as input makes the meaning behind the formula quite a bit clearer: P(B|E): The probability of being a female (the belief B) given long hair (the evidence E). This part of the formula defines what you want to predict. In short, it says to predict y given x where y is an outcome (male or female) and x is the evidence (long or short hair). P(E|B): The probability of having long hair, the evidence of when a person is female. In this case, you already know that it’s 60 percent. In every data problem, you can obtain this figure easily by simple cross-tabulation of the features against the target outcome. P(B): The probability of being a female, which has a 50 percent general chance (a prior). P(E): The probability of having long hair in general, which is 35 percent (another prior). When reading parts of the formula such as P(B|E), you should read them as follows: probability of B given E. The | symbol translates as given. A probability expressed in this way is a conditional probability, because it’s the probability of a belief, B, conditioned by the evidence presented by E. In this example, plugging the numbers into the formula translates into 60% * 50% / 35% = 85.7% Therefore, getting back to the previous example, even if being a female is a 50 percent probability, just knowing evidence like long hair takes it up to 85.7 percent, which is a more favorable chance for the guess. You can be more confident in guessing that the person with long hair is a female because you have a bit less than a 15 percent chance of being wrong.

View Article
Linear Regression vs. Logistic Regression

Article / Updated 02-18-2020

Both linear and logistic regression see a lot of use in data science but are commonly used for different kinds of problems. You need to know and understand both types of regression to perform a full range of data science tasks. Of the two, logistic regression is harder to understand in many respects because it necessarily uses a more complex equation model. The following information gives you a basic overview of how linear and logistic regression differ. The equation model Any discussion of the difference between linear and logistic regression must start with the underlying equation model. The equation for linear regression is straightforward. y = a + bx You may see this equation in other forms and you may see it called ordinary least squares regression, but the essential concept is always the same. Depending on the source you use, some of the equations used to express logistic regression can become downright terrifying unless you’re a math major. However, the start of this discussion can use one of the simplest views of logistic regression: p = f(a + bx) >p, is equal to the logistic function, f, applied to two model parameters, a and b, and one explanatory variable, x. When you look at this particular model, you see that it really isn’t all that different from the linear regression model, except that you now feed the result of the linear regression through the logistic function to obtain the required curve. The output (dependent variable) is a probability ranging from 0 (not going to happen) to 1 (definitely will happen), or a categorization that says something is either part of the category or not part of the category. (You can also perform multiclass categorization, but focus on the binary response for now.) The best way to view the difference between linear regression output and logistic regression output is to say that the following: Linear regression is continuous. A continuous value can take any value within a specified interval (range) of values. For example, no matter how closely the height of two individuals matches, you can always find someone whose height fits between those two individuals. Examples of continuous values include: Height Weight Waist size Logistic regression is discrete. A discrete value has specific values that it can assume. For example, a hospital can admit only a specific number of patients in a given day. You can’t admit half a patient (at least, not alive). Examples of discrete values include: Number of people at the fair Number of jellybeans in the jar Colors of automobiles produced by a vendor The logistic function Of course, now you need to know about the logistic function. You can find a variety of forms of this function as well, but here’s the easiest one to understand: f(x) = ex / ex + 1 You already know about f, which is the logistic function, and x equals the algorithm you want to use, which is a + bx in this case. That leaves e, which is the natural logarithm and has an irrational value of 2.718, for the sake of discussion (check out a better approximation of the whole value). Another way you see this function expressed is f(x) = 1 / (1 + e-x) Both forms are correct, but the first form is easier to use. Consider a simple problem in which a, the y-intercept, is 0, and ">b, the slope, is 1. The example uses x values from –6 to 6. Consequently, the first f(x) value would look like this when calculated (all values are rounded): (1) e-6 / (1 + e-6) (2) 0.00248 / 1 + 0.00248 (3) 0.002474 As you might expect, an xvalue of 0 would result in an f(x) value of 0.5, and an x value of 6 would result in an f(x) value of 0.9975. Obviously, a linear regression would show different results for precisely the same x values. If you calculate and plot all the results from both logistic and linear regression using the following code, you receive a plot like the one below. import matplotlib.pyplot as plt %matplotlib inline from math import exp x_values = range(-6, 7) lin_values = [(0 + 1*x) / 13 for x in range(0, 13)] log_values = [exp(0 + 1*x) / (1 + exp(0 + 1*x)) for x in x_values] plt.plot(x_values, lin_values, 'b-^') plt.plot(x_values, log_values, 'g-*') plt.legend(['Linear', 'Logistic']) plt.show() This example relies on list comprehension to calculate the values because it makes the calculations clearer. The linear regression uses a different numeric range because you must normalize the values to appear in the 0 to 1 range for comparison. This is also why you divide the calculated values by 13. The exp(x) call used for the logistic regression raises e to the power of x, ex, as needed for the logistic function. The model discussed here is simplified, and some math majors out there are probably throwing a temper tantrum of the most profound proportions right now. The Python or R package you use will actually take care of the math in the background, so really, what you need to know is how the math works at a basic level so that you can understand how to use the packages. This section provides what you need to use the packages. However, if you insist on carrying out the calculations the old way, chalk to chalkboard, you’ll likely need a lot more information. The problems that logistic regression solves You can separate logistic regression into several categories. The first is simple logistic regression, in which you have one dependent variable and one independent variable, much as you see in simple linear regression. However, because of how you calculate the logistic regression, you can expect only two kinds of output: Classification: Decides between two available outcomes, such as male or female, yes or no, or high or low. The outcome is dependent on which side of the line a particular data point falls. Probability: Determines the probability that something is true or false. The values true and false can have specific meanings. For example, you might want to know the probability that a particular apple will be yellow or red based on the presence of yellow and red apples in a bin. Fit the curve As part of understanding the difference between linear and logistic regression, consider this grade prediction problem, which lends itself well to linear regression. In the following code, you see the effect of trying to use logistic regression with that data: x1 = range(0,9) y1 = (0.25, 0.33, 0.41, 0.53, 0.59, 0.70, 0.78, 0.86, 0.98) plt.scatter(x1, y1, c='r') lin_values = [0.242 + 0.0933*x for x in x1] log_values = [exp(0.242 + .9033*x) / (1 + exp(0.242 + .9033*x)) for x in range(-4, 5)] plt.plot(x1, lin_values, 'b-^') plt.plot(x1, log_values, 'g-*') plt.legend(['Linear', 'Logistic', 'Org Data']) plt.show() The example has undergone a few changes to make it easier to see precisely what is happening. It relies on the same data that was converted from questions answered correctly on the exam to a percentage. If you have 100 questions and you answer 25 of them correctly, you have answered 25 percent (0.25) of them correctly. The values are normalized to produce values between 0 and 1 percent. As you can see from the image above, the linear regression follows the data points closely. The logistic regression doesn’t. However, logistic regression often is the correct choice when the data points naturally follow the logistic curve, which happens far more often than you might think. You must use the technique that fits your data best, which means using linear regression in this case. A pass/fail example An essential point to remember is that logistic regression works best for probability and classification. Consider that points on an exam ultimately predict passing or failing the course. If you get a certain percentage of the answers correct, you pass, but you fail otherwise. The following code considers the same data used for the example above, but converts it to a pass/fail list. When a student gets at least 70 percent of the questions correct, success is assured. y2 = [0 if x < 0.70 else 1 for x in y1] plt.scatter(x1, y2, c='r') lin_values = [0.242 + 0.0933*x for x in x1] log_values = [exp(0.242 + .9033*x) / (1 + exp(0.242 + .9033*x)) for x in range(-4, 5)] plt.plot(x1, lin_values, 'b-^') plt.plot(x1, log_values, 'g-*') plt.legend(['Linear', 'Logistic', 'Org Data']) plt.show() This is an example of how you can use list comprehensions in Python to obtain a required dataset or data transformation. The list comprehension for y2 starts with the continuous data in y1 and turns it into discrete data. Note that the example uses precisely the same equations as before. All that has changed is the manner in which you view the data, as you can see below. Because of the change in the data, linear regression is no longer the option to choose. Instead, you use logistic regression to fit the data. Take into account that this example really hasn’t done any sort of analysis to optimize the results. The logistic regression fits the data even better if you do so.

View Article
How Data is Collected and Why It Can Be Problematic

Article / Updated 02-18-2020

Because data is so valuable and users are sometimes adverse to giving it up, vendors constantly find new ways to collect data. One such method comes down to spying. Microsoft, for example, was recently accused (yet again) of spying on Windows 10 users even when the user doesn’t want their data collected. Lest you think that Microsoft is solely interested in your computing concerns, think again. The data Microsoft admits to collecting (and there is likely more) is pretty amazing. Microsoft’s data gathering doesn’t stop with your Windows 10 actions; it also collects data with Cortana, the personal assistant. Mind you, Alexa is accused of doing the same thing. Google, likewise, does the same thing. So, one of the trends the vendors are using is spying, and it doesn’t stop with Microsoft, nor does it stop with the obvious spying sources. It might actually be possible to write an entire book on the ways in which people are spying on you, but that would make for a very paranoid book, and there are other new data collection trends to consider. You may have noticed that you get more email from everyone about the services or products you were provided. Everyone wants you to provide free information about your experiences in one of these forms: Close-ended surveys: A close-ended survey is one in which the questions have specific answers that you check mark. The advantage is greater consistency of feedback. The disadvantage is that you can’t learn anything beyond the predefined answers. Open-ended surveys: An open-ended survey is one in which the questions rely on text boxes in which the user enters data manually. In some cases, this form of survey enables you to find new information, but at the cost of consistency, reliability, and cleanliness of the data. One-on-one interviews: Someone calls you or approaches you at a place like the mall and talks to you. When the interviewer is well trained, you obtain consistent data and can also discover new information. However, the quality of this information comes at the cost of paying someone to obtain it. Focus group: Three or more people meet with an interviewer to discuss a topic (including products). Because the interviewer acts as a moderator, the consistency, reliability, and cleanliness of the data remain high and the costs are lower. However, now the data suffers contamination from the interaction between members of the focus group. Direct observation: No conversation occurs in this case; someone monitors the interactions of another party with a product or service and records the responses using a script. However, because you now rely on a third party to interpret someone else’s actions, you have a problem with contamination in the form of bias. In addition, if the subject of the observation is aware of being monitored, the interactions likely won’t reflect reality. These are just a few of the methods that are seeing greater use in data collection today. They’re just the tip of the iceberg. The key takeaway here is that no perfect means exists for collecting some types of data and all data collection methods require some sort of participative event. Don’t want to find yourself in trouble? Here are ten mistakes to avoid when investing in data science.

View Article
How to Perform Pattern Matching in Python

Article / Updated 02-17-2020

Pattern matching in computers is as old as the computers themselves. In looking at various sources, you can find different starting points for pattern matching, such as editors. However, the fact is that you can’t really do much with a computer system without having some sort of pattern matching occur. For example, the mere act of stopping certain kinds of loops requires that a computer match a pattern between the existing state of a variable and the desired state. Likewise, user input requires that the application match the user’s input to a set of acceptable inputs. Using pattern matching in data analysis Developers recognize that function declarations also form a kind of pattern and that to call the function successfully, the caller must match the pattern. Sending the wrong number or types of variables as part of the function call causes the call to fail. Data structures also form a kind of pattern because the data must appear in a certain order and be of a specific type. Where you choose to set the beginning for pattern matching depends on how you interpret the act. Certainly, pattern matching isn’t the same as counting, as in a for loop in an application. However, someone could argue that testing for a condition in a while loop matches the definition of pattern matching to some extent. Many people look at editors as the first use of pattern matching because editors were the first kinds of applications to use pattern matching to perform a search, such as to locate a name in a document. Searching is most definitely part of the act of analysis because you must find the data before you can do anything with it. The act of searching is just one aspect, however, of a broader application of pattern matching in analysis. The act of filtering data also requires pattern matching. A search is a singular approach to pattern matching in that the search succeeds the moment that the application locates a match. Filtering is a batch process that accepts all the matches in a document and discards anything that doesn’t match, enabling you to see all the matches without doing anything else. Filtering can also vary from searching in that searching generally employs static conditions, while filtering can employ some level of dynamic condition, such as locating the members of a set or finding a value within a given range. Filtering is the basis for many of the analysis features in declarative languages, such as SQL, when you want to locate all the instances of a particular data structure (a record) in a large data store (the database). The level of filtering in SQL is much more dynamic than in mere filtering because you can now apply conditional sets and limited algorithms to the process of locating particular data elements. Regular expressions, although not the most advanced of modern pattern-matching techniques, offer a good view of how pattern matching works in modern applications. You can check for ranges and conditional situations, and you can even apply a certain level of dynamic control. Even so, the current master of pattern matching is the algorithm, which can be fully dynamic and incredibly responsive to particular conditions. Working with pattern matching Pattern matching in Python closely matches the functionality found in many other languages. Python provides robust pattern-matching capabilities using the regular expression (re) library. Here’s a good overview of the Python capabilities. The sections below detail Python functionality using a number of examples. Performing simple Python matches All the functionality you need for employing Python in basic RegEx tasks appears in the re library. The following code shows how to use this library: import re vowels = "[aeiou]" print(re.search(vowels, "This is a test sentence.").group()) The search() function locates only the first match, so you see the letter i as output because it’s the first item in vowels. You need the group() function call to output an actual value because search() returns a match object. When you look at the Python documentation, you find quite a few functions devoted to working with regular expressions, some of them not entirely clear in their purpose. For example, you have a choice between performing a search or a match. A match works only at the beginning of a string. Consequently, this code: print(re.match(vowels, "This is a test sentence.")) returns a value of None because none of the vowels appears at the beginning of the sentence. However, this code: print(re.match("a", "abcde").group()) returns a value of a because the letter a appears at the beginning of the test string. Neither search nor match will locate all occurrences of the pattern in the target string. To locate all the matches, you use findall or finditer instead. For example, this code: print(re.findall(vowels, "This is a test sentence.")) returns a list like this: ['i', 'i', 'a', 'e', 'e', 'e', 'e'] Because this is a list, you can manipulate it as you would any other list. Match objects are useful in other ways. For example, you can create a more complete search by using the start()and end()functions, as shown in the following code: testSentence = "This is a test sentence." m = re.search(vowels, testSentence) while m: print(testSentence[m.start():m.end()]) testSentence = testSentence[m.end():] m = re.search(vowels, testSentence) This code keeps performing searches on the remainder of the sentence after each search until it no longer finds a match, as shown here: i i a e e e e Using the finditer() function would be easier, but this code points out that Python does provide everything needed to create relatively complex pattern-matching code. Doing more than pattern matching Python’s regular expression library makes it quite easy to perform a wide variety of tasks that don’t strictly fall into the category of pattern matching. One of the most commonly used is splitting strings. For example, you might use the following code to split a test string using a number of whitespace characters: testString = "This is\ta test string.\nYippee!" whiteSpace = "[\s]" print(re.split(whiteSpace, testString)) The escaped character, \s, stands for all space characters, which includes the set of [ \t\n\r\f\v. The split() function can split any content using any of the accepted regular expression characters, so it’s an extremely powerful data manipulation function. The output from this example looks like this: ['This', 'is', 'a', 'test', 'string.', 'Yippee!'] Performing substitutions using the sub() function is another forte of Python. Rather than perform common substitutions one at a time, you can perform them all simultaneously, as long as the replacement value is the same in all cases. Consider the following code: testString = "Stan says hello to Margot from Estoria." pattern = "Stan|hello|Margot|Estoria" replace = "Unknown" re.sub(pattern, replace, testString) The output of this example is Unknown says Unknown to Unknown from Unknown. You can create a pattern of any complexity and use a single replacement value to represent each match. This is handy when performing certain kinds of data manipulation for tasks such as dataset cleanup prior to analysis.

View Article
How Pattern Matching Works in Data Science

Article / Updated 02-17-2020

Pattern matching is extremely useful in data science. But, how exactly does pattern matching work? Patterns consist of a set of qualities, properties, or tendencies that form a characteristic or consistent arrangement — a repetitive model. Humans are good at seeing strong patterns everywhere and in everything. In fact, we purposely place patterns in everyday things, such as wallpaper or fabric. However, computers are better than humans are at seeing weak or extremely complex patterns because computers have the memory capacity and processing speed to do so. The capability to see a pattern is pattern matching. Pattern matching is an essential component in the usefulness of computer systems and has been from the outset, so learning the functionality is hardly something radical or new. Even so, understanding how computers find patterns is incredibly important in defining how this seemingly old technology plays such an important part in new applications such as AI, machine learning, deep learning, and data analysis of all sorts. The most useful patterns are those that we can share with others. To share a pattern with someone else, you must create a language to define it — an expression. Here, you also discover regular expressions, a particular kind of pattern language, and their use in performing tasks such as data analysis. The creation of a regular expression helps you describe to an application what sort of pattern it should find, and then the computer, with its faster processing power, can locate the precise data you need in a minimum amount of time. This basic information helps you understand more complex pattern matching of the sort that occurs within the realms of AI and advanced data analysis. Of course, working with patterns using pattern matching through expressions of various sorts works a little differently in the functional programming paradigm. How to find patterns in data When you look at the world around you, you see patterns of all sorts. The same holds true for data that you work with, even if you aren’t fully aware of seeing the pattern at all. For example, telephone numbers and social security numbers are examples of data that follows one sort of pattern — that of a positional pattern. A telephone number in the United States consists of an area code of three digits, an exchange of three digits (even though the exchange number is no longer held by a specific exchange), and an actual number within that exchange of four digits. The positions of these three entities is important to the formation of the telephone number, so you often see a telephone number pattern expressed as (999) 999-9999 (or some variant), where the value 9 is representative of a number. The other characters provide separation between the pattern elements to help humans see the pattern. Other sorts of patterns exist in data, even if you don’t think of them as such. For example, the arrangement of letters from A to Z is a pattern. This statement may not seem like a revelation, but the use of this particular pattern occurs almost constantly in applications when the application presents data in ordered form to make it easier for humans to understand and interact with the data. Organizational patterns are essential to the proper functioning of applications today, yet humans take them for granted, for the most part. Another sort of pattern is the progression. One of the easiest and most often applied patterns in this category is the exponential progression expressed as Nx, where a number N is raised to the x power. For example, an exponential progression of 2 starting with 0 and ending with 4 would be: 1, 2, 4, 8, and 16. The language used to express a pattern of this sort is the algorithm, and you often use programming language features, such as recursion, to express it in code. Some patterns are abstractions of real-world experiences. Consider color, for example. To express color in terms that a computer can understand requires the use of three or four three-digit variables, where the first three are always some value of red, blue, and green. The fourth entry can be an alpha value, which expresses opacity, or a gamma value, which expresses a correction used to define a particular color with the display capabilities of a device in mind. These abstract patterns help humans model the real world in the computer environment so that still other forms of pattern matching can occur (along with other tasks, such as image augmentation or color correction). Transitional patterns help humans make sense of other data. For example, referencing all data to a known base value enables you to compare data from different sources, collected at different times and in different ways, using the same scale. Knowing how various entities collect the required data provides the means for determining which transition to apply to the data so that it can become useful as part of a data analysis. Data can even have patterns when missing or damaged. The pattern of unusable data could signal a device malfunction, a lack of understanding of how the data collection process should occur, or even human behavioral tendencies. The point is that patterns occur in all sorts of places and in all sorts of ways, which is why having a computer recognize them can be important. Humans may see only part of the picture, but a properly trained computer can potentially see them all. So many kinds of patterns exist that documenting them all fully would easily take an entire book. Just keep in mind that you can train computers to recognize and react to data patterns automatically in such a manner that the data becomes useful to humans in various endeavors. The automation of data patterns is perhaps one of the most useful applications of computer technology today, yet very few people even know that the act is taking place. What they see instead is an organized list of product recommendations on their favorite site or a map containing instructions on how to get from one point to another — both of which require the recognition of various sorts of patterns and the transition of data to meet human needs. What are regular expressions? Regular expressions are special strings that describe a data pattern. The use of these special strings is so consistent across programming languages that knowing how to use regular expressions in one language makes it significantly easier to use them in all other languages that support regular expressions. As with all reasonably flexible and feature-complete syntaxes, regular expressions can become quite complex, which is why you’ll likely spend more than a little time working out the precise manner by which to represent a particular pattern to use in pattern matching. You use regular expressions to refer to the technique of performing pattern matching using specially formatted strings in applications. However, the actual code class used to perform the technique appears as Regex, regex, or even RegEx, depending on the language you use. Some languages use a different term entirely, but they’re in the minority. Consequently, when referring to the code class rather than the technique, use Regex (or one of its other capitalizations). The following information constitutes a brief overview of regular expressions. You can find the more details in Python's documentation. This source of additional help can become quite dense and hard to follow, though, so you might also want to review the tutorial for further insights. Defining special characters using escapes in pattern matching Character escapes usually define a special character of some sort, very often a control character. You escape a character using the backslash (\), which means that if you want to search for a backslash, you must use two backslashes in a row (\\). The character in question follows the escape. Consequently, \b signals that you want to look for a backspace character. Programming languages standardize these characters in several ways: Control character: Provides access to control characters such as tab (\t), newline (\n), and carriage return (\r). Note that the \n character (which has a value of \u000D) is different from the \r character (which has a value of \u000A). Numeric character: Defines a character based on numeric value. The common types include octal (\nnn), hexadecimal (\xnn), and Unicode (\unnnn). In each case, you replace the n with the numeric value of the character, such as \u0041for a capital letter A in Unicode. Note that you must supply the correct number of digits and use 0s to fill out the code. Escaped special character: Specifies that the regular expression compiler should view a special character, such as ( or [, as a literal character rather than as a special character. For example, \( would specify an opening parenthesis rather than the start of a subexpression. Defining wildcard characters in pattern matching A wildcard character can define a kind of character, but never a specific character. You use wildcard characters to specify any digit or any character at all. The following list tells you about the common wildcard characters. Your language may not support all these characters, or it may define characters in addition to those listed. Here's what the following characters match with: Character Matches With . Any character (with the possible exception of the newline character or other control characters). \w Any word character \W Any nonword character \s Any whitespace character \S Any non-whitespace character \d Any decimal digit \D Any nondecimal digit Working with anchors in pattern matching Anchors define how to interact with a regular expression. For example, you may want to work with only the start or end of the target data. Each programming language appears to implement some special conditions with regard to anchors, but they all adhere to the basic syntax (when the language supports the anchor). The following table defines the commonly used anchors: Anchor What It Does ^ Looks at the start of the string. $ Looks at the end of the string. * Matches zero or more occurrences of the specified character. + Matches one or more occurrences of the specified character. The character must appear at least once. ? Matches zero or one occurrences of the specified character. {m} Specifies m number of the preceding characters required for a match. {m,n} Specifies the range from m to n, which is the number of the preceding characters required for a match. expression|expression Performs or searches where the regular expression compiler will locate either one expression or the other expression and count it as a match. You may find figuring out some of these anchors difficult. The idea of matching means to define a particular condition that meets a demand. For example, consider this pattern: h?t, which would match hit and hot, but not hoot or heat, because the ? anchor matches just one character. If you instead wanted to match hoot and heat as well, then you’d use h*t, because the * anchor can match multiple characters. Using the right anchor is essential to obtaining a desired result. Delineating subexpressions using grouping constructs A grouping construct tells the regular expression compiler to treat a series of characters as a group. For example, the grouping construct [a-z] tells the regular expression compiler to look for all lowercase characters between a and z. However, the grouping construct[az] (without the dash between a and z) tells the regular expression compiler to look for just the letters a and z, but nothing in between, and the grouping construct [^a-z] tells the regular expression compiler to look for everything but the lowercase letters a through z. The following list describes the commonly used grouping constructs. The italicized letters and words in this list are placeholders. Construct What It Means [x] Look for a single character from the characters specified by x. [x-y] Search for a single character from the range of characters specified by x and y. [^expression] Locate any single character not found in the character expression. (expression) Define a regular expression group. For example, ab{3} would match the letter a and then three copies of the letter b, that is, abbb. However, (ab){3} would match three copies of the expression ab: ababab.

View Article
The Need for Reliable Sources in Data Science Applications

Article / Updated 02-17-2020

Data science seems like a terribly precise field, but the outcomes are only as reliable as your data. The word reliable seems so easy to define when it comes to data sources, yet so hard to implement. A data source is reliable when the results it produces are both expected and consistent. A reliable data source produces mundane data that contains no surprises; no one is shocked in the least by the outcome. On the other hand, depending on your perspective, it could actually be a good thing that most people aren’t yawning and then falling asleep when reviewing data. That’s because the surprises make the data source worth analyzing and reviewing. Consequently, data has an aspect of duality. You want reliable, mundane, fully anticipated data that simply confirms what you already know, but the unexpected is what makes collecting the data useful in the first place. You can also define reliability by the number of failure points contained in any measured resource. More failure points automatically mean lower reliability if you have two data sources of equal reliability. Given that general data analysis, AI, machine learning, and deep learning all require huge amounts of information, the methodology used automatically reduces the reliability of such data because you have more failure points to consider. Consequently, you must have data from highly reliable sources of the correct type. Scientists began fighting against impressive amounts of data for years before anyone coined the term big data. At that point, the Internet didn’t produce the vast sums for data that it does today. Remember that big data is not just simply a fad created by software and hardware vendors but has a basis in many of the following fields: Astronomy: Consider the data received from spacecraft on a mission (such as Voyager or Galileo) and all the data received from radio telescopes, which are specialized antennas used to receive radio waves from astronomical bodies. A common example is the Search for Extraterrestrial Intelligence (SETI) project, which looks for extraterrestrial signals by observing radio frequencies arriving from space. The amount of data received and the computer power used to analyze a portion of the sky for a single hour is impressive. If aliens are out there, it’s very hard to spot them. (The movie Contact explores what could happen should humans actually intercept a signal.) Meteorology: Think about trying to predict weather for the near term given the large number of required measures, such as temperature, atmospheric pressure, humidity, winds, and precipitation at different times, locations, and altitudes. Weather forecasting is really one of the first problems in big data, and quite a relevant one. According to Weather Analytics, a company that provides climate data, more than 33 percent of the Worldwide Gross Domestic Product (GDP) is determined by how weather conditions affect agriculture, fishing, tourism, and transportation, just to name a few. Dating back to the 1950s, the first supercomputers of the time were used to crunch as much as data as possible because, in meteorology, the more data, the more accurate the forecast. That’s the reason everyone is amassing more storage and processing capacity, as you can read in this story regarding the Korean Meteorological Association for weather forecasting and studying climate change. Physics: Consider the large amounts of data produced by experiments using particle accelerators in an attempt to determine the structure of matter, space, and time. For example, the Large Hadron Collider, the largest particle accelerator ever created, produces 15PB (petabytes) of data every year as a result of particle collisions. Genomics: Sequencing a single DNA strand, which means determining the precise order of the many combinations of the four bases — adenine, guanine, cytosine, and thymine — that constitute the structure of the associated molecule, requires quite a lot of data. For instance, a single chromosome, a structure containing the DNA in the cell, may require from 50MB to 300MB. A human being normally has 46 chromosomes, and the DNA data for just one person consumes an entire DVD. Just imagine the massive storage required to document the DNA data of a large number of people or to sequence other life forms on earth. Oceanography: Gathers data from the many sensors placed in the oceans to measure statistics, such as temperature and currents, using hydrophones and other sensors. This data even includes sounds for acoustic monitoring for scientific purposes (discovering characteristics about fish, whales, and plankton) and military defense purposes (finding sneaky submarines from other countries). You can have a sneak peek at this old surveillance problem, which is turning more complex and digital. Satellites: Recording images from the entire globe and sending them back to earth to monitor the Earth’s surface and its atmosphere isn’t a new business (TIROS 1, the first satellite to send back images and data, dates back to 1960). Over the years, however, the world has launched more than 1,400 active satellites that provide earth observation. The amount of data arriving on earth is astonishing and serves both military (surveillance) and civilian purposes, such as tracking economic development, monitoring agriculture, and monitoring changes and risks. A single European Space Agency’s satellite, Sentinel 1A, generates 5PB of data during two years of operation. All these data sources have one thing in common: Someone collects and stores the data as static information (once collected, the data doesn’t change). This means that if errors are found, correcting them with an overall increase in reliability is possible. The key takeaway here is that you likely deal with immense amounts of data from various sources that could have any number of errors. Finding these errors in such huge quantities is nearly impossible. Using the most reliable sources that you can will increase the overall quality of the original data, reducing the effect of individual data failure points. In other words, sources that provide consistent data are more valuable than sources that don’t.

View Article
The Basics of Deep Learning Framework Usage and Low-End Framework Options

Article / Updated 02-17-2020

A deep learning framework is an abstraction that provides generic functionality, which your application code modifies to serve its own purposes. Unlike a library that runs within your application, when you’re using a framework, your application runs within it. You can’t modify basic deep learning framework functionality, which means that you have a stable environment in which to work, but most frameworks offer some level of extensibility. Deep learning frameworks are generally specific to a particular need, such as the web frameworks used to create online applications. When thinking about a deep learning framework, what you’re really considering is how the framework manages the frozen spots and the hot spots used by the application. In most cases, a deep learning framework provides frozen spots and hot spots in these areas: Hardware access (such as using a GPU with ease) Standard neural network layer access Deep learning primitive access Computational graph management Model training Model deployment Model testing Graph building and presentation Inference (forward propagation) Automatic differentiation (backpropagation) A good deep learning framework also exhibits specific characteristics that you may not find in other framework types. These characteristics help create an environment in which the deep learning framework enables you to create intelligent applications that learn and process data quickly. Here are some of the characteristics to consider when looking at a deep learning framework: Optimizes for performance rather than resource usage or some other consideration Performs tasks using parallel operations to reduce the time spent creating a model and associated neural network Computes gradients automatically Makes coding easy because many of the people using deep learning frameworks aren’t developers, but rather subject matter experts Interacts well with standard libraries used for plotting, machine learning, and statistics Deep learning frameworks address other issues, such as providing good community support for specific problem domains, and the focus on specific issues determines the viability of a particular framework for a particular purpose. As with many forms of software development aid, you need to choose the deep learning framework you use carefully. Working with low-end deep learning frameworks Low-end deep learning frameworks often come with a built-in trade-off. You must choose between cost and usage complexity, as well as the need to support large applications in challenging environments. The trade-offs you’re willing to endure will generally reflect what you can use to complete your project. With this caveat in mind, the following information discusses a number of low-end frameworks that are incredibly useful and work well with small to medium-size projects, but that come with trade-offs for you to consider as well. Chainer Chainer is a library written purely in Python that relies on the NumPy and CuPy libraries. Preferred Networks leads the development of this library, but IBM, Intel, Microsoft, and NVidia also play a role. The main point with this library is that helps you use the CUDA capabilities of your GPU by adding only a few lines of code. In other words, this library gives you a simple way to greatly enhance the speed of your code when working with huge datasets. Many deep learning libraries today, such as Theano and TensorFlow, use a static deep learning approach called define and run, in which you define the math operations and then perform training based on those operations. Unlike Theano and TensorFlow, Chainer uses a define-by-run approach, which relies on a dynamic deep learning approach in which the code defines math operations as the training occurs. Here are the two main advantages to this approach: Intuitive and flexible approach: A define-by-run approach can rely on a language’s native capabilities rather than require you to create special operations to perform analysis. Debugging: Because the define-by-run approach defines the operations during training, you can rely on the internal debugging features to locate the source of errors in a dataset or the application code. TensorFlow 2.0 can also use define-by-run by relying on Chainer to provide eager execution. PyTorch PyTorch is the successor to Torch written in the Lua language. A core one of the Torch libraries (the PyTorch autograd library) started as a fork of Chainer. Facebook initially developed PyTorch, but many other organizations use it today, including Twitter, Salesforce, and the University of Oxford. Here are the features that make PyTorch special: Extremely user friendly Efficient memory usage Relatively fast Commonly used for research Some people like PyTorch because it’s easy to read like Keras, but the scientist doesn’t lose the ability to use complicated neural networks. In addition, PyTorch supports dynamic computational model graphing directly, which makes it more flexible than TensorFlow without the addition of TensorFlow Fold. MXNet The biggest reason to use MXNet is speed. It might be hard to figure out whether MXNet or CNTK is faster, but both products are quite fast and are often used as a contrast to the slowness that some people experience when working with TensorFlow. (This white paper provides some details on benchmarking of deep learning code.) MXNet is an Apache product that supports a host of languages including Python, Julia, C++, R, and JavaScript. Numerous large organizations use it, including Microsoft, Intel, and Amazon web Services. Here are the aspects that make MXNet special: Features advanced GPU support Can be run on any device Provides a high-performance imperative API Offers easy model serving Provides high scalability It may sound like the perfect product for your needs, but MXNet does come with at least one serious failing: It lacks the level of community support that TensorFlow offers. In addition, most researchers don’t look at MXNet favorably because it can become complex, and a researcher isn’t dealing with a stable model in most cases. Microsoft Cognitive Toolkit/CNTK Its speed is one of the reasons to use the Microsoft Cognitive Toolkit (CNTK). Microsoft uses CNTK for big datasets — really big ones. As a product, it supports the Python, C++, C#, and Java programming languages. Consequently, if you’re a researcher who relies on R, this isn’t the product for you. Microsoft has used this product in Skype, Xbox and Cortana. This product’s special features are Great performance High scalability Highly optimized components Apache Spark support Azure Cloud support As with MXNet, CNTK has a distinct problem in its lack of adequate community support. In addition, it tends not to provide much in the way of third-party support, either, so if the package doesn’t contain the features you need, you might not get them at all. Fully evaluate your needs before selecting your deep learning framework.

View Article
Data Science and Recommender Systems

Article / Updated 02-12-2020

A recommender system can suggest items or actions of interest to a user, after having learned the user’s preferences over time. The technology, which is based on data and machine learning techniques (both supervised and unsupervised), has appeared on the Internet for about two decades. Today you can find recommender systems almost everywhere, and they’re likely to play an even larger role in the future under the guise of personal assistants, such as Siri (developed by Apple), Amazon Alexa, Google Home, or some other artificial-intelligence–based digital assistant. The drivers for users and companies to adopt recommender systems are different but complementary: Users: Have a strong motivation to reduce the complexity of the modern world (regardless of whether the issue is finding the right product or a place to eat) and avoid information overload. Companies: Need recommender to systems provide a practical way to communicate in a personalized way with their customers and successfully push sales. Recommender systems actually started as a means to handle information overload. The Xerox Palo Alto Research Center built the first recommender in 1992. Named Tapestry handled the increasing number of emails received by center researchers. The idea of collaborative filtering was born by learning from users and leveraging similarities in preferences. The GroupLens project soon extended recommender systems to news selection and movie recommendations (the MovieLens project). When giant players in the e-commerce sector, such as Amazon, started adopting recommender systems, the idea went mainstream and spread widely in e-commerce. Netflix did the rest by promoting recommenders as a business tool and sponsoring a competition to improve its recommender system that involved various teams for quite a long time. The result is an innovative recommender technology that uses SVD and Restricted Boltzmann Machines (a kind of unsupervised neural network). However, recommender systems aren’t limited to promoting products. Since 2002, a new kind of Internet service has made its appearance: social networks such as Friendster, Myspace, Facebook, and LinkedIn. These services promote exchanges between users and share information such as posts, pictures, and videos. In addition, these services help create links between people with similar interests. Search engines, such as Google, amassed user response information to offer more personalized services and understand how to match user’s desires when responding to users’ queries better. Recommender systems have become so pervasive in guiding people’s daily life that experts now worry about the impact on our ability to make independent decisions and perceive the world in freedom. A recommender system can blind people to other options — other opportunities — in a condition called filter bubble. By limiting choices, a recommender system can also have negative impacts, such as reducing innovation. You can read about this concern in the articles at dorukkilitcioglu.com and technologyreview.com. One detailed study of the effect, entitled “Exploring the Filter Bubble: the Effect of Using Recommender Systems on Content Diversity,” appears on ACM. The history of recommender systems is one of machines striving to learn about our minds and hearts, to make our lives easier, and to promote the business of their creators.

View Article
The Need for Standardized Data Collection Techniques

Article / Updated 02-12-2020

The data you use for data science programming initiatives comes from a number of sources. The most common data source is from information entered by humans at some point. Even when a system collects shopping-site data automatically, humans initially enter the information. A human clicks various items, adds them to a shopping cart, specifies characteristics (such as size) and quantity, and then checks out. Later, after the sale, the human gives the shopping experience, product, and delivery method a rating and makes comments. In short, every shopping experience becomes a data-collection exercise as well. You can’t sit by each shopper’s side and provide instructions on how to enter data consistently. Consequently, the data you receive is inconsistent and nearly unusable at times. By reviewing forms of successful online stores, however, you can see how to provide a virtual self to assist the shopper in making consistent entries. The forms you provide for entering information have a great deal to do with the data you collect. When a form contains fewer handwritten entries and more check boxes, it tends to provide a better experience for the customer and a more consistent data source for you. Many data sources today rely on input gathered from human sources. Humans also provide manual input. You call or go into an office somewhere to make an appointment with a professional. A receptionist then gathers information from you that’s needed for the appointment. This manually collected data eventually ends up in a dataset somewhere for analysis purposes. By providing training on proper data entry techniques, you can improve the consistency of input that the receptionist provides. In addition, you’re unlikely to have just one receptionist providing input, so training can also help the entire group of receptionists provide consistent input despite individual differences in perspective. Some forms of regulated data entry of this sort have become so complex today that the people doing it actually require a formal education, such as medical data entry personnel. The point is that the industry, as a whole, is generally moving toward trained data entry people, so your organization should make use of this trend to improve the consistency of the data you receive. Data is also collected from sensors, and these sensors can take almost any form. For example, many organizations base physical data collection, such as the number of people viewing an object in a window, on cellphone detection. Facial recognition software could potentially detect repeat customers. However, sensors can create datasets from almost anything. The weather service relies on datasets created by sensors that monitor environmental conditions such as rain, temperature, humidity, cloud cover, and so on. Robotic monitoring systems help correct small flaws in robotic operation by constantly analyzing data collected by monitoring sensors. A sensor, combined with a small AI application, could tell you when your dinner is cooked to perfection tonight. The sensor collects data, but the AI application uses rules to help define when the food is properly cooked. Of the forms of data collection, the data provided by sensors is the easiest to make consistent. However, sensor data is often inconsistent because vendors keep adding functionality as a means of differentiation. The solution to this problem is better data standards so that vendors must adhere to certain specifics when creating data. Standards efforts are ongoing, but it pays to ensure that the sensors you use to collect data all rely on the same standards to ensure that you obtain consistent inpu

View Article
page 1
page 2