Connecting Predictive Analytics to Related Disciplines

Data Science Essentials For Dummies

Predictive analytics makes heavy use of three related disciplines: data mining, statistics, and machine learning. All four disciplines intersect to such a large degree that their names are often used interchangeably. Just to keep the record straight, there are some distinctions: predictive analytics combines many of the techniques, tools, and algorithms that data mining, statistics, and machine learning have in common.

Its goal, however, is to use those tools to understand the data, analyze it, learn from it, and create a mathematical model that can make useful business predictions. That futuristic orientation is what differentiates predictive analytics as it combines aspects of the other three disciplines in the various steps and or stages required to develop a predictive model.

It's the predictive capability that sets apart this specialized use of statistics, data mining, and machine learning. Let’s examine the contributions of each discipline.

Statistics

Statistics are numbers that result from measuring empirical data over time. As a discipline, statistics concerns itself with gathering and analyzing data — often to tell a story in numbers.

Statistics can infer relationships within data; thus it plays an active role in presenting data and helping people understand it. Using only a sample of the data, statistics can give you a basis for inferring hypotheses about the data as a whole. Thus the discipline is both descriptive and inferential.

Inferential statistics is concerned with making inferences or predictions about the characteristics of a whole population from a sample dataset. Sample data for analysis is chosen at random, in a way that represents (but doesn't include) the whole population, which is hard to have access to. When the sample dataset is well chosen, analyzing it allows the investigator to project the findings of the analysis onto the whole population with a measurable degree of accuracy.

Statistics relies, of course, on mathematics to analyze the data by verifying a hypothesis or theory, determining its accuracy, and drawing a conclusion.

It is the statisticians’ job to prove or disprove, with a single sample, whether their hypothesis about the population is true. The fact that a hypothesis precedes the analysis is a clear distinction for statistics, and a hallmark that differentiates it from other techniques.

Unlike data mining, statistics doesn't involve data cleansing or preprocessing. But descriptive statistics and data mining do have data grouping in common; both aim to describe data and define the processes used to categorize it.

A common denominator among all these disciplines is the underlying mathematics. The math is at the heart of statistics, and all algorithms and programming used in data mining, machine learning, and predictive analytics. Another common denominator is data analysis, a quest for better-informed, smarter decisions about future outcomes.

Data mining

Data mining is concerned mainly with analyzing data through describing, summarizing, classifying, and clustering the data so as to find useful patterns, links, or associations within it.

Data mining is often used interchangeably with machine learning but there are some distinctions between the two terms. For example: data miners are familiar with the use of machine learning to perform some tasks involving large amounts of data. However, they can also create a sophisticated and optimized query on a database without the use of machine learning — which would still be considered data mining. This is similar to knowledge discovery in databases, (KDD). It finds knowledge in data and emphasizes a broad application of particular data mining techniques.

Machine learning

Machine learning is another discipline that focuses on analyzing the data and making sense of it — but does so by teaching a computer program to do the work. The more data is analyzed, the more the system can learn about it, what to expect from it, and what to do with it. Data analysis becomes a source of continuous feedback to the system, which promptly analyzes data, attempts to understand and make sense of it, and gets better at those tasks.

When a new special case, usually an exception or a new behavior, is processed for the first time, the knowledge base of the system is incremented to include the new case. The more special cases are handled, the better equipped the system is to make decisions that produce the most useful outcome. This is the nature of the learning that the machine is programmed to do.

Machine learning is the equivalent of teaching a system the rules of a game, and then getting the system to practice the game by playing at elementary and intermediate levels. After that preparation, the system can play at advanced levels in real time.

IBM Watson uses natural language processing and machine learning to reveal insights from large amounts of unstructured data. Watson was used in a 2011 Jeopardy! match, and it won. Before that, IBM’s Deep Blue beat world champion Gary Kasparov in chess. In 2016, Google’s AlphaGo beat the world champion, Lee Sedol, in Go (a much harder game to conquer, because Go is exponentially more complex than chess).

Machine learning is perfectly suited for

Complex data
Data in various forms
Data collected from diverse sources
Data in large amounts

Data mining can uncover previously unknown connections and associations in the data. Machine learning can categorize the new and upcoming unknowns, learn from them based on its previous processing of the data, and get better at incorporating them into the known data. Both techniques lead to greater insight and understanding of the data.