Data Mining - dummies

By Thomas C. Hammergren

The distinguishing characteristic about data mining, as compared with querying, reporting, or even OLAP, is that you can get information without having to ask specific questions.

Data mining serves two primary roles in your business intelligence mission:

  • The “Tell me what might happen” role: The first role of data mining is predictive, in which you basically say, “Tell me what might happen.” Using hidden knowledge locked away in your data warehouse, probabilities and the likelihood of future trends and occurrences are ferreted out and presented to you.

  • The “Tell me something interesting” role: In addition to possible future events and occurrences, data mining also tries to pull out interesting information that you probably should know about, such as a particularly unusual relationship between sales of two different products and how that relationship varies according to placement in your retail stores.

    Although many of these interesting tidbits are likely to exist, what questions would you ask if you were using a querying or OLAP tool, and how would you interpret the results? Data mining assists you in this arduous task of figuring out what questions to ask by doing much of the grunt work for you.

Data mining in specific business missions

Data mining is particularly suited for these specific types of business missions:

  • Detecting fraud

  • Determining marketing program effectiveness

  • Selecting whom, from a large customer base or the general population, you should target as part of a marketing program

  • Managing customer life cycle, including the customer retention mission

  • Performing advanced business process modeling and what-if scenarios

Think about what’s behind each of the business missions in the preceding list:

  • A large amount of data

  • An even larger number of combinations of various pieces of data

  • Intensive results set analysis, usually involving complex algorithms and advanced statistical techniques

Now, think about what you would have to do if you were using a reporting or OLAP tool to accomplish these missions. You’d find it virtually impossible to thoroughly perform any of the preceding missions if you had to ask a question and get a result, ask another question and get another result, and then keep repeating those steps.

Data mining and artificial intelligence

If you’ve been in the information technology (IT) field for at least a decade, some of the preceding terms might sound vaguely familiar. Unlocking hidden knowledge? Predictive functionality? Wait a minute — that’s artificial intelligence!

From the earliest days of commercial computing, there has been a tremendous interest in developing “thinking machines” that can process large amounts of data and make decisions based on that analysis.

Interest in artificial intelligence (AI) hit its zenith in the mid-1980s. At that time, database vendors worked on producing knowledge base management systems (KBMSs); other vendors came out with expert system shells, or AI-based application development frameworks that used techniques such as forward-chaining and backward-chaining to advise users about decisions; and neural networks were positioned as the next big AI development.

Interest in AI waned in the early 1990s, when expectations exceeded available capabilities and other frenzies, such as client/server migration and (of course) data warehousing, took center stage.

Now, AI is back!

The highest-profile AI technique used in data mining is neural networks. Neural nets were originally envisioned as a processing model that would mimic the way the human brain solves problems, using neurons and highly parallel processing to do pattern solving.

Applying neural network algorithms to the areas of business intelligence that data mining handles (again, predictive and “tell me something interesting” missions) seems to be a natural match.

Although the data mining/neural network game is definitely worth checking into, you should do it carefully. You can find a lot of interesting and exciting technologies that, in the hands of those who don’t understand the algorithms, will likely fail.

However, with proper knowledge and education, you can make a full-scale commitment to bringing this type of processing into your business intelligence framework as the technical-analysis pairing for the OLAP-focused business analysis.

Data mining and statistics

The more mature area of data mining is the application of advanced statistical techniques against the large volumes of data in your data warehouse. Different tools use different types of statistical techniques, tailored to the particular areas they’re trying to address.

Without a statistical background, you might find much of data mining confusing. You need to do a lot of work to train the algorithms and build the rules to ensure proper results with larger datasets. However, assuming that you’re comfortable with this concept, or have a colleague who can assist, here are some of the more widely leveraged algorithms:

  • Classification algorithms: Predict one or more discrete variables, based on the other attributes in the dataset. By using classification algorithms, the data mining tool can look at large amounts of data and then inform you that, for example, “Customers who are retained through at least two generations of product purchases tend to have these characteristics: They have an income of at least $75,000, and they own their own homes.”

  • Regression algorithms: Predict one or more continuous variables, such as profit or loss, based on other attributes in the dataset. Regression algorithms are driven through historical information presented to the data mining tool “over time,” better known as time series information.

  • Segmentation algorithms: Divide data into groups, or clusters, of items that have similar properties.

  • Association algorithms: Find correlations between different attributes in a dataset. The most common application of this kind of algorithm creates association rules, which you can use in a market basket analysis. Note that, for example, if a customer purchases a particular software package, he or she has a 65-percent chance of purchasing at least two product-specific add-on packs within two weeks.

  • Sequence analysis algorithms: Summarize frequent sequences or episodes in data, such as a web-path flow.

Many more methods exist. Dust off that old statistics book and start reading.