|
Published:
September 29, 2014

Data Mining For Dummies

Overview

Delve into your data for the key to success

Data mining is quickly becoming integral to creating value and business momentum. The ability to detect unseen patterns hidden in the numbers exhaustively generated by day-to-day operations allows savvy decision-makers to exploit every tool at their disposal in the pursuit of better business. By creating models and testing whether patterns hold up, it is possible to discover new intelligence that could change your business's entire paradigm for a more successful outcome.

Data Mining for Dummies shows you why it doesn't take a data scientist to gain this advantage, and empowers average business people to start shaping a process relevant to their business's needs. In this book, you'll learn the hows and whys of mining to the depths of your data, and how to make the case

for heavier investment into data mining capabilities. The book explains the details of the knowledge discovery process including:

  • Model creation, validity testing, and interpretation
  • Effective communication of findings
  • Available tools, both paid and open-source
  • Data selection, transformation, and evaluation

Data Mining for Dummies takes you step-by-step through a real-world data-mining project using open-source tools that allow you to get immediate hands-on experience working with large amounts of data. You'll gain the confidence you need to start making data mining practices a routine part of your successful business. If you're serious about doing everything you can to push your company to the top, Data Mining for Dummies is your ticket to effective data mining.

Read More

About The Author

Meta S. Brown helps organizations use practical data analysis to solve everyday business problems. A hands-on data miner who has tackled projects with up to $900 million at stake, she is a recognized expert in cutting-edge business analytics.

Sample Chapters

data mining for dummies

CHEAT SHEET

Data mining is the way that ordinary businesspeople use a range of data analysis techniques to uncover useful information from data and put that information into practical use. Data miners don’t fuss over theory and assumptions. They validate their discoveries by testing. And they understand that things change, so when the discovery that worked like a charm yesterday doesn’t hold up today, they adapt.

HAVE THIS BOOK?

Articles from
the book

Data mining is done by trial and error, and so, for data miners, making mistakes is only natural. Mistakes can be valuable, in other words, at least under certain conditions. Not all mistakes are created equal, however. Some are just better avoided. The following list offers ten such mistakes. If you read through them carefully, and commit them to memory, you just might avoid a few bumps on the learning curve: Skipping data quality checks: Most data miners think developing predictive models is more fun than reviewing data for quality problems.
Fresh information about data mining is made available to you every single day through specialty analytics sites, blogs, professional organizations, and even in the news. Deepen your data-mining knowledge with these resources. Society of Data Miners Professional organizations help members to advance their knowledge and careers.
You don’t have to be an expert in every data mining technique, of course, but a little knowledge about other tools and approaches can prepare you well for new challenges. This list introduces you to ten such approaches. Business analysis Business analysis is the study of business systems and processes with the aim of improving them.
Your own internal data is often the most relevant data you can get. Government and nonprofit sources offer valuable data free. Use these sources whenever you can! When those sources don’t meet your needs, you’ll have to turn to commercial data suppliers. But which suppliers? Acxiom Acxiom is a major source for consumer marketing data.
Following is a list of key analytics software sources. Each of these plays an important role in the data and analysis landscape, and each offers tools with some analytics capability. All of them are known to some extent as data-mining tools. Alteryx About the developer: Alteryx Inc. is a privately held corporation based in the United States.
Data miners work fast. One way to improve your productivity is to take full advantage of tools that let you do several things at once. It’s time-consuming (and boring) to set up a number of graphs separately, one at a time. So use these alternatives whenever you can:Data summariesTools that let you quickly ask for summaries of many variables, and get the summaries all at once.
Because data miners lean heavily on basic graphs, some data-mining applications offer little or nothing more. Others provide a wide range of graph options, from the common to the exotic. It’s not necessary to use all of these, but you may benefit by selecting and using a few that suit your own needs. Data miners often use these graphs:Boxplot (also called box and whiskers)Histograms describe distributions of continuous variables, but have limited value for showing details.
Every profession has its guiding principles, ideas that provide structure and guidance in everyday work. Data mining is no exception. Following are nine fundamental ideas to guide you as you get down to work and become a data miner. These are the 9 Laws of Data Mining as they were originally stated by the pioneering data miner, Thomas Khabaza.
A data miner is a businessperson who has a feel for numbers, not a programmer, database manager, or statistician. Data mining enables businesspeople to rapidly discover useful patterns in data, build models, and put them into action in everyday business. To do data mining, you need tools to fit the job, tools designed for users like you.
Rushing your choice of data-mining software comes with risks. Your data-mining projects begin with plans. Until you have defined what you intend to accomplish and your own working requirements, don’t focus on software. Good preparation protects you from these common risks: Inadequate software capabilities: New data miners often find that the software they’ve selected doesn’t have the full set of capabilities they require.
Data miners often take advantage of special features to pack more information into simple charts. Labels, overlays, and interactive selection are hallmarks of data-mining applications, special features that allow you to be more productive. Mileage decreases as horsepower increases, as seen in the following figure.
You can learn more about using commercially available data for business and consumer marketing by connecting with marketers and market researchers who share your interests, as well as data vendors. These professional associations are a good starting point for making contacts: American Marketing Association Direct Marketing Association Advertising Research Foundation Although this list represents only a small portion of the hundreds of data suppliers active in today’s market, even these few provide a wide range of offerings, covering millions of individuals.
To introduce you to the kinds of consumer information available through commercial suppliers, look at a detailed example. The table includes all the data collected about one consumer by Axciom, a major vendor of consumer marketing data. This vendor provides marketing data about individual consumers and the households in which those consumers live, as follows: Individual consumers: For each individual, the vendor divides information into two data categories: Characteristic: Demographics such as age, marital status, level of education attained, and whether the consumer has children.
Data miners often sort cases (change the order of rows) to get clearer organization for viewing data or export. Or, you may have a functional reason to sort. For example, some applications require sorting data before merging (joining columns from different data sources). The steps for sorting vary a lot from one application to another.
Summarizing data, finding totals, and calculating averages and other descriptive measures are probably not new to you. When you need your summaries in the form of new data, rather than reports, the process is called aggregation. Aggregated data can become the basis for additional calculations, merged with other datasets, used in any way that other data is used.
Not all the data you may need is about people. Perhaps you’re more interested in businesses or nonprofit organizations. Maybe you have an interest in thunderstorms, pineapples, or bridges. No problem. Commercial sources can provide data for all these things, and many more. If data is available that you value enough to consider paying for it, somebody probably is out there ready to sell.
Data mining is the way that ordinary businesspeople use a range of data analysis techniques to uncover useful information from data and put that information into practical use. Data miners don’t fuss over theory and assumptions. They validate their discoveries by testing. And they understand that things change, so when the discovery that worked like a charm yesterday doesn’t hold up today, they adapt.
Perhaps you have shopped at one of the warehouse clubs, retail chain stores that offer members-only shopping in large, no-frills stores. Warehouse clubs have bare concrete floors, plain functional shelving, and limited choices of products and package sizes. Their check-out lanes don’t offer bags, let alone baggers, to pack up your purchases.
Online environments present data miners with a unique mix of challenges and advantages for data collection and analysis. Here’s the bad news: Web data formats can be difficult to import and manipulate in data-mining applications. Systems that serve web pages are often poorly integrated with sales tracking systems, making it hard to identify connections between the visitor’s experience and the resulting actions.
The United States is only one of many governments that share data with the public. While you won’t find exactly the same range or types of data from every country, you will find that most nations have some data to share. There are also some intergovernmental and nonprofit organizations that offer international data resources.
Finding the data you need from state and local governments can be very challenging. Some states are more interested in sharing data than others. You can’t count on every state or local government to have an open data portal, or on finding someone in the local government to help you find what you need or address your questions.
The U.S. government includes over 100 statistical agencies, agencies with a primary purpose of collecting and analyzing data for some government use. The result is a vast resource of professionally collected, managed, and analyzed data, much of which is available to you. Bureau of Economic Analysis. The Bureau of Economic Analysis (BEA) is a part of the United States Department of Commerce.
Data collected by large organizations in the course of everyday business is usually stored in databases. But database administrators may not be willing to allow data miners direct access to these data sources, and direct access may not be the best option from your point of view either. Direct access to operational (used for routine business operations) databases can be a bad idea because Data miners use a lot of data.
Perhaps the most common application for experiments in data mining, legitimate controlled experiments much like the ones that scientists use, is direct marketing. Direct marketing involves contacting individual people. When you get a text or an email from a retailer, that’s direct marketing. Traditional mail order catalogs, phone calls from charities, and campaign letters from political candidates are all forms of direct marketing.
A data-mining project begins when you identify a specific business issue to investigate. The narrower and better-defined the question, the more effectively it can be answered. The more clearly the question is defined, the more clearly the data requirements can be understood, as well as the limitations of the answer.
Humans use experience when they interpret the data they see, but computers can’t. Your data-mining software will do its best to identify the kind of data in each column, but data types are often ambiguous. When you see a list of ZIP Codes, you don’t try to add and subtract them. You know that they represent places.
Surveys may be the most common and familiar approach for obtaining your own unique data from people. Anybody can write a few questions, present them to some people, and there you have it . . . a survey. Good surveys, though, require thought and effort. In survey research, people are asked to answer questions, usually about themselves.
As a data miner, you want data-mining tools, time to devote to a worthwhile data-mining project, or maybe just the opportunity to do something new and different from the usual routine. In your business case, you’re not setting out to make anyone and everyone desire data mining. You’re setting out to convince a specific group of people that their pain is too much to live with, that your plan can make that pain go away, and that you can be trusted to do that.
The order of variables (columns) in a dataset is usually just a matter of how they were arranged in the source file or the database query that was used to import them. That arrangement may not be convenient for you. If you have many variables, it may be hard to spot the ones you want to see. Or perhaps some order makes sense to you, and you’d like the variables arranged that way.
Your first hands-on step with data is getting it from wherever it is to the place where you need it to be. Text formats are common, and you’re likely to encounter them often. One of the most common is comma-separated value (.csv) text. KNIME.com AG is a small software and services firm focused on data mining. It offers a data-mining product with a visual programming interface.
The Bioinformatics Laboratory of the Faculty of Computer and Information Science, University of Ljubljana, Slovenia, develops Orange in cooperation with an open source community. To open the sample data in Orange, follow these steps:Start Orange Canvas.University of Ljubljana does not offer support agreements.
RapidMiner is a small software and services firm focused on data mining. It offers a data-mining product with a visual programming interface. To open the sample data in RapidMiner, follow these steps:Start RapidMiner Studio.RapidMiner is an offshoot of the YALE development project of the Dortmund University of Technology (Germany).
University of Waikato faculty members develop tools as part of their work toward advancement of the field of machine learning. These tools are used in teaching, by scientists, and in industry. Weka is its general-purpose data-mining tool that offers a visual programming interface and a wide range of analytics capabilities.
The first step toward predictive modeling is relating variables to one another. A simple, remarkable tool for that is the scatterplot. It’s used to relate one continuous measure to another. Data miners sometimes stretch the rules and use it with categorical variables as well. The horizontal (x) axis of the plot represents values of one variable; the vertical axis (y) represents a second variable.
Surveys are useful for collecting data about almost any aspect of human life. You can only ignore surveys if your profession has nothing to do with people, like say, astrophysics. Then again, astrophysicists need people to fund their research and want people to visit planetariums, so they might need surveys, too!
If you have a loyalty program and the data it produces, what are you supposed to do with it? As a data miner, it’s your role to provide decision makers with analysis that supports the business. Some executives understand loyalty programs and may request specific information, perhaps more of it than you have hours to provide.
A basic part of the data-understanding phase of the data-mining process is investigating variables one at a time, reviewing their distributions, and checking for obvious data quality issues. Bar charts and histograms are visual summaries that make it easy and quick to understand variable distributions. The two chart types are very similar.

General Data Science

Using codes for data reduces data entry time, prevents errors, and reduces the memory requirements for storing the data. But the codes aren’t meaningful unless you have documentation, or labels, to explain their meaning. Some data formats enable you to enjoy the advantages of using codes while keeping the information about the meaning of the codes in the same file.
A loyalty program is an agreement between a business and its customers. Customers agree to allow the business to track purchases (and possibly other actions as well), and in return, the business offers rewards. Typical rewards include lower prices or a free product or service. You may be involved in several loyalty programs as a customer right now.
It’s not just your own interests that can cause a project’s scope to expand. As you work, you’ll have discussions with coworkers, and they’ll all have ideas and questions to inspire more exploration. Asking questions and exploring data can be fun. Now that you are a data miner, you’ll find that you can ask and answer questions that were previously beyond your reach.
An awful lot of data miners rely exclusively on a little bag of data mining tricks they learned years ago and don’t regularly invest time in adding new skills to the mix. The reasoning is usually simple and understandable. They are getting the job done, but they’re busy. They don’t have time for exploring new things, especially new things they aren't sure they need.
How did Tom Khabaza come to lay down the laws of data mining? There’s something to be said for being first on the scene. Khabaza started data mining in the early 1990s, when few people had even heard of data mining, let alone tried it. He began his career in psychology and gravitated to the study of cognition, human learning.
When your data is in more than one place, you need ways to put it all together. When you join two datasets with different variables, you’re merging data. Merging is a common operation. Merging is used frequently in data mining, combining linked data such as Customer records and marketing campaign data Before and after test results Internal and vendor data To merge datasets, you must have a variable that identifies cases for matching; this is called a key or identifier variable.
Most political campaigns depend on consultants to provide voter research, or else get by with very informal assessments of voter attitudes and interest in voting for a particular candidate (or voting at all). But in recent years, certain political campaigns, including both candidate and issue campaigns, have begun to use microtargeting, organized programs of survey research and message testing, to develop and deliver personalized campaign messaging tailored to individual voters.
Data mining has very strict requirements for data organization. They are not exotic, complex, or difficult requirements to meet, but they are strict. The figure shows a sample of data viewed as a table in data-mining software. Each row represents one parcel of real estate. Information about the parcels of real estate is organized in columns.
The Cross-Industry Standard Process for Data Mining (CRISP-DM) is the dominant process framework for data mining. In the first phase of a data-mining project, before you approach data or tools, you define what you’re out to accomplish and define the reasons for wanting to achieve this goal. The business understanding phase includes four tasks (primary activities, each of which may involve several smaller parts).
In the second phase of the Cross-Industry Standard Process for Data Mining (CRISP-DM) process model, you obtain data and verify that it is appropriate for your needs. You might identify issues that cause you to return to business understanding and revise your plan. You may even discover flaws in your business understanding, another reason to rethink goals and plans.
Data miners spend most of their time on the third phase of the Cross-Industry Standard Process for Data Mining (CRISP-DM) process model: data preparation. Most data used for data mining was originally collected and preserved for other purposes and needs some refinement before it is ready to use for modeling. The data preparation phase includes five tasks.
Modeling is the part of the Cross-Industry Standard Process for Data Mining (CRISP-DM) process model that most data miners like best. Your data is already in good shape, and now you can search for useful patterns in your data. The modeling phase includes four tasks. These are Selecting modeling techniques Designing test(s) Building model(s) Assessing model(s) Task: Selecting modeling techniques The wonderful world of data mining offers oodles of modeling techniques, but not all of them will suit your needs.
In the first four phases of the Cross-Industry Standard Process for Data Mining (CRISP-DM) process model, you’ve explored data and you’ve found patterns, and now you have to ask: Are the results any good? You’ll evaluate not just the models you create but also the process that you used to create them, and their potential for practical use.
Deployment is where data mining pays off. In this final phase of the Cross-Industry Standard Process for Data Mining (CRISP-DM) process, it doesn’t matter how brilliant your discoveries may be, or how perfectly your models fit the data, if you don’t actually use those things to improve the way that you do business.
The Cross-Industry Standard Process for Data Mining (CRISP-DM) is the dominant data-mining process framework. It's an open standard; anyone may use it. The following list describes the various phases of the process. Business understanding: Get a clear understanding of the problem you're out to solve, how it impacts your organization, and your goals for addressing it.
Data privacy is a big issue for data miners. News reports outlining the level of personal data in the hands of the US government's National Security Agency and breaches of commercial data sources have raised public awareness and concern. A central concept in data privacy is personally identifiable information (PII), or any data that can be traced to the individual person it describes.
Now that you are a data miner, you’re also a primary researcher. Sounds more scientific, doesn’t it? Your research is primary because you will begin from raw (basic, unprocessed) data and analyze it to add something new to the world’s knowledge. You’ll probably also integrate some secondary research into your work.
Before you begin searching for data to mine on data.gov, the federal data portal, you must understand one thing: There is no data on the site. Data.gov is home to a data catalog, a list of dataset names with details such as descriptions, formats, and urls for obtaining data and additional information. The data itself is hosted and shared by the individual government agencies that create it, and each agency does things in its own way.
When you are data mining, sometimes you’ll have more data than you need for a given project. Here’s how to pare down to just what you need. Narrowing the fields When you have many variables in a dataset, it can be hard to find or see the ones that interest you. And if your datasets are large, and you don’t need all the variables, keeping the extras soaks up resources unnecessarily.
You may need to use data that’s in a spreadsheet, XML (extensible markup language), or any of dozens of less common formats. The key question will always be: Does your data-mining application import data in that format? As long as your data-mining application has a tool to read the data format you need, the process will be straightforward — just a small variation depending on your data source.
You’re not getting into data mining just for the fun of playing with numbers. You want action. You want to see things done right, and you understand that it’s important to base business decisions on solid evidence from data. But you’re not the one with the power to make the decisions. So you’ll need to exercise your influence with the people who have authority.
Pioneering data miner Thomas Khabaza developed his "Nine Laws of Data Mining" to guide new data miners as they get down to work. This reference guide shows you what each of these laws means to your everyday work. 1st Law of Data Mining, or "Business Goals Law": Business objectives are the origin of every data mining solution.
If you’re looking for data that the federal government might have, but you aren’t sure which agency is involved, start your search on the federal data portal. There you will find a searchable catalog of data from all federal agencies. You can search for datasets by keywords and get information about what’s available, the source for each dataset, the formats available, and where to find the data.
Despite the many desirable aspects of survey research, you also find limitations. It’s difficult to get good data when the subjects are people, no matter how you go about it. Even scientific researchers, who make every effort to conduct controlled studies, cannot control experimental conditions with human subjects as they do with lab animals.
As a data miner, your place in the organizational chart may be in a special group devoted to analytics, or within any conventional business unit. No matter where you’re placed, whether you’re dabbling in data mining or making a full-time job of it, you will be most productive if you are familiar with the roles of other business units and on good terms with appropriate staff members.
A data miner’s discoveries have value only if a decision maker is willing to act on them. As a data miner, your impact will be only as great as your ability to persuade someone — a client, an executive, a government bureaucrat — of the truth and relevance of the information you have to share. This means you’ve got to learn to tell a good story — not just any story, but one that honestly conveys the facts and their implications in a way that is compelling for your decision maker.
A data miner has nothing without data. And if you work in a large organization, you’ll have hundreds, perhaps thousands, of existing data resources potentially available for data mining. Every activity generates records, and those records can become your raw material. The table shows the variety of commonly collected data in a number of business activities.
Data miners work fast. To get speed, you’ll need to use appropriate tools and discover the tricks of the trade. Your best data-mining tool is your brain, with a bit of know-how. The second-best tool is a data-mining application with a visual programming interface. With visual programming, the steps in your work process are represented by small images that you organize on the screen to create a picture of the flow and logic of your work.
If you think of data as raw material, and the information you can get from data as something valuable and relatively refined, the process of extracting information can be compared to extracting metal from ore or gems from dirt. That’s how the term data mining originated. Focusing on the business of data mining Data miners don’t just ponder data aimlessly, hoping to find something interesting.
If you’ve read a few news reports about data mining, you may have gotten the impression that it’s more complex than brain surgery. It isn’t. You may have heard that data miners can learn things about you that you don’t even know yourself. That’s unlikely. You may have heard that you need a Ph.D. and reams of data to get started in data mining, and that’s ridiculous.
Data mining has costs — costs for software, costs for labor, costs for servers, and perhaps costs to obtain data as well. To justify paying for all of this, you may be required to prepare a business case. A business case outlines a specific business problem, a proposed plan to address it, and the associated benefits and costs.
https://cdn.prod.website-files.com/6630d85d73068bc09c7c436c/69195ee32d5c606051d9f433_4.%20All%20For%20You.mp3

Frequently Asked Questions

No items found.