What Data Miners Do
If you think of data as raw material, and the information you can get from data as something valuable and relatively refined, the process of extracting information can be compared to extracting metal from ore or gems from dirt. That’s how the term data mining originated.
Focusing on the business of data mining
Data miners don’t just ponder data aimlessly, hoping to find something interesting. Every data-mining project begins with a specific business problem and a goal to match.
As a data miner, you probably won’t have the authority to make final business decisions, so it’s important that you align your work with the needs of decision makers. You must understand their problems, needs, and preferences, and focus your efforts on providing information that supports good business decisions.
Your own business knowledge is very important. Executives are not going to sit next to you while you work, providing feedback on the relevance of your discoveries to their concerns. You must use your own experience and acumen to judge that for yourself as you work.
Understanding how data miners spend their time
It would be great if data miners could spend all day making life-changing discoveries, building valuable models, and integrating them into everyday business. But that’s like saying it would be great if athletes could spend all day winning tournaments. It takes a lot of preparation to build up to those moments of triumph. So, like athletes, data miners spend a lot of time on preparation.
Getting to know the data-mining process
A good work process helps you make the most of your time, your data, and all your other resources. In this book, you’ll discover the most popular data-mining process, CRISP-DM. It’s a six-phase cycle of discovery and action created by a consortium of data miners from many industries, and an open standard that anyone may use.
The phases of the CRISP-DM process are
Deployment (using models in everyday business)
Each phase carries equal weight in importance to the quality of the results and value to the business. But in terms of the time required, data preparation dominates. Data preparation routinely takes more time than all other phases of the data-mining process combined.
When the goals are understood, and the data is cleaned up and ready to use, you can turn your attention to building predictive models. Models do what reports cannot; they give you information that supports action.
A report can tell you that sales are down. It can break sales down by region, product, and channel so that you know where sales declined and whether these declines were widespread or affected only certain areas. But they don’t give you any clues about why sales declined or what actions might help to revive the business.
Models help you understand the factors that impact sales, the actions that tend to increase or decrease sales, and the strategies and tactics that keep your business running smoothly. That’s exciting, isn’t it? Maybe that’s why most data miners consider modeling to be the fun part of the job.
Understanding mathematical models
Mathematical models are central to data mining, but what are they? What do they do, how do they work, and how are they are created?
A mathematical model is, plain and simple, an equation, or set of equations, that describe a relationship between two or more things. Such equations are shorthand for theories about the workings of nature and society. The theory may be supported by a substantial body of evidence or it may be just a wild guess. The language of mathematics is the same in either case.
Terms such as predictive model, statistical model, or linear model refer to specific types of mathematical models, the names reflecting the intended use, the form, or the method of deriving a particular model. These three examples are just a few of many such terms.
When a model is mentioned in a business setting, it’s most likely a model used to make predictions. Models are used to predict stock prices, product sales, and unemployment rates, among many other things.
These predictions may or may not be accurate, but for any given set of values (known factors like these are called independent variables or inputs) included in the model, you will find a well-defined prediction (also called a dependent variable, output, or result). Mathematical models are used for other purposes in business, as well, such as to describe the working mechanisms that drive a particular process.
In data mining, you create models by finding patterns in data using machine learning or statistical methods. Data miners don’t follow the same rigorous approach that classical statisticians do, but all models are derived from actual data and consistent mathematical modeling techniques. All data-mining models are supported by a body of evidence.
Why use mathematical models? Couldn’t the same relationships be described using words? That’s possible, yet you find certain advantages to the use of equations. These include
Convenience: Compared with equivalent descriptions written out in sentences, equations are brief. Mathematical symbolism has evolved specifically for the purpose of representing mathematical relationships; languages such as English have not.
Clarity: Equations convey ideas succinctly and are unambiguous. They’re not subject to differing interpretations based on culture, and the symbolism of mathematics is a sort of common language used widely across the globe.
Consistency: Because mathematical representations are unambiguous, the implications of any particular situation are clearly defined by a mathematical model.
Putting information into action
A model only delivers value when you use it in the business. A model’s predictions might support decision making in a variety of ways. You might
Incorporate predictions into a report or presentation to be used in making a specific decision.
Integrate the model into an operational system (such as a customer service system) to provide real-time predictions for everyday use. (For example, you might flag insurance claims for immediate payment, immediate denial, or further investigation.)
Use the model for batch predictions. (For example, you could score the in-house customer list to decide which customers should receive a particular offer.)