Enterprise AI For Dummies
Book image
Explore Book Buy On Amazon
Recently Gartner analyst Nick Heudecker generated a firestorm of debate when he said a previous Gartner statistic that reported 60 percent of big data projects fail was too conservative and that an 85 percent failure rate is more accurate. Either way, it’s a daunting statistic.

One way to avoid becoming a statistic is to approach your AI journey using an industry-proven model — the machine learning development life cycle. This figure shows the seven elements of the methodology. This methodology is based on the cross-industry standard process for data mining (CRISP-DM), a widely used open standard process model that describes common approaches used by data mining experts.

The machine learning development life cycle. The machine learning development life cycle.

The table shows the questions that must be answered for each element.

The Machine Learning Development Life cycle: Elements and Questions
Element Question
Define the task What problem or question do you want to address with data?
Collect data What data do you have that could answer our questions?
Prepare the data What do you need to do to prepare the data for mining?
Build the model How can you mimic or enhance the human’s knowledge or actions through technology?
Test and evaluate the model What new information do you know now?
Deployment and integrate the model What actions should you trigger with the new information? What needs human validation?
Maintain the model How has the data changed over time? Do the results reflect current reality?
The process of developing a machine learning model is highly iterative. Often, you will find yourself returning to previous steps before proceeding to a subsequent one. A machine learning project is not considered complete after the first version has been deployed. Instead, the feedback you collect after the initial version helps you shape new goals and improvements for the next iteration.

In the light of this feedback-and-iterate practice, the model is more a life cycle than a process, largely because, for the most part, in the model, data drives the process, not a hunch or policy or committee or some immutable principle. You start with a hypothesis or a burning question, such as “What do all our loyal customers have in common?” or flip it to ask “What do all our cancellations have in common?” Then you gather the required data, train a model with historical data, run current data to answer that question, and then act on the answer. The steering group provides input along the way, but the data reflects the actual, not the hypothetical.

This principle of data-driven discovery and action is an important part of the life cycle because it assures that the process is defensible and auditable. It keeps the project from going off the rails and down a rabbit hole.

Using the life cycle, you will always be able to answer questions such as how and why you created a particular model, how you will assess its accuracy and effectiveness, how you will use it in a production environment, and how it will evolve over time. You will also be able to identify model drift and determine whether changes to the model based on incoming data are pointing you toward new insights or diverting you toward undesired changes in scope.

Of the seven steps in the methodology, the first three take up the most time. You may recall that cleaning and organizing data takes up to 60 percent of the time of a data scientist. There’s a good reason for that. Bad data can cost up to 25 percent of revenue.

However, all that time spent preparing the data will be wasted if you don’t really know what you want out of the data.

Define the task

What problem or question do you want to address with data? Follow these steps:
  1. Determine your business objectives.
  2. Assess the situation.
  3. Determine your data mining goals.
  4. Produce a project plan.

Some people think of AI as a magic machine where you pour data into the hopper, turn the crank, and brilliance comes out the other end. The reality is that a data science project is the process of actually building the machine, not turning the crank. And before you build a machine, you must have a very clear picture of what you want the machine to do.

Even though the process is data-driven, you don’t start with data. You start with questions. You may have a wealth of pristine data nicely tailored to databases, but if you don’t know what you’re trying to do, when you turn the crank, the stuff that comes out the other end might be interesting, but it won’t be actionable.

That’s why you start with questions. If you ask the right questions, you will know what kind of data you need. And if you get the right data, at the end you will get the answers — and likely more questions as well.

During the business understanding step, you establish a picture of what success will look like by determining the criteria for success. This step starts with a question. In the course of determining what you need to answer the question, you explore the terminology, assumptions, requirements, constraints, risks, contingencies, costs, and benefits related to the question and assemble an inventory of available resources.

For example, your initial question might be “What is causing an increase in customer churn?” This question could be expanded to ask “Can you pinpoint specific sources of friction in the customer journey that are leading to churn?”

Pursuing that question may lead you to brainstorming and research, such as documenting the touchpoints in the current customer journey, analyzing the revenue impact of churn, and listing suspected candidates for friction.

Collect the data

What data do you have that may be able to answer your questions? Follow these steps:
  1. Collect initial data.
  2. Describe the data.
  3. Explore the data.
  4. Verify data quality.
To get to where you’re going, first you must know where you are.

Remember that moment in The Princess Bride when Westley, Inigo, and Fezzik list their assets and liabilities before storming the castle and determine that they will need a wheelbarrow and that a holocaust cloak would come in handy? That was data understanding.

During the data understanding step, you establish the type of data you need, how you will acquire it, and how much data you need. You may source your data internally, from second parties such as solution partners, or from third-party providers.

For example, if you are considering a solution for predictive maintenance on a train, you might pull information from Internet of Things (IoT) sensors, weather patterns, and passenger travel patterns.

To make sure you have the data required to answer your questions, you must first ask questions. What data do you have now? Are you using all the data you have? Maybe you’re collecting lots of data, but you use only three out of ten fields.

This step takes time, but it is an essential exercise that will increase the likelihood that you can trust the results and that you aren’t misled by the outcomes.

Prepare the data

What do you need to do to prepare the data for mining? Follow these steps:
  1. Select the data.
  2. Clean the data.
  3. Construct the data.
  4. Integrate the data.
  5. Format the data.
Select the data: In this current data-rich environment, narrowing the options to identify the exact data you need can pose a challenge. Factors to consider are relevance and quality. In cases that might be sensitive to bias, you must pay close attention to seemingly unrelated fields that might serve as a proxy. In a classic example, a loan approval process excluded race from its model, but included ZIP code, which often correlates directly with race, so the process retained the same level of bias as before.

Clean the data: The available data for your project may have issues, such as missing or invalid values or inconsistent formatting. Cleaning the data involves establishing a uniform notation to express each value and setting default values or using a modeling technique to estimate suitable values for empty fields.

Construct the data: In some cases, you might need a field that can be calculated or inferred from other fields in the data. For example, if you are doing analysis by sales region, detailed order records may not include the region, but that information can be derived from the address. You might even need to create new records to indicate the absence of an activity, such as creating a record with a value of zero to indicate the lack of sales for a product in a region.

Integrate the data: You might encounter a situation where you need to combine information from different data sources that store the data in different ways. For example, suppose you are analyzing store sales by region; if you don’t have a table for store-level data, you need to aggregate the order information for each store from individual orders to create store-level data. Or you may need to merge data from multiple tables. For example, in the store sales by region analysis, you may combine regional information such as manager and sales team from one source with store information from another source into one table.

Format the data: The data you need might be trapped in an image, such as a presentation or graphic, in which case you would have to extract it through some method, such as optical character recognition, and then store the information as structured data.

Build the model

How can you mimic or enhance the human’s knowledge or actions through technology? Follow these steps:
  1. Select an algorithm and modeling techniques.
  2. Test the fit.
  3. Build the model.
  4. Assess the model.
This step represents the primary role of a data scientist. Based on the data and likely best fit, the data scientist selects what should be the most promising algorithm, often from an open source library like MLlib. Then, the data scientist uses techniques like those available in popular programming languages like R or Python to build a usable ML model based on the algorithm and the data. The process can take some time based on peculiarities in the data or the nuances of your business. In the end, however, based on training the algorithm using the sanitized historical data, you get actionable information such as a prediction or a next best action.

By now, the modeling technique to use should be an obvious choice based on the questions you developed at the beginning and the data you have to work with.

After you have trained the model using the source data set, test its accuracy with the test data set. One way of evaluating test results for a classification model is to use a confusion matrix, which is a simple classification quadrant, also known as Pasteur’s quadrant.

For a simple example, consider a binary classifier that produces a yes or no answer. There are two ways of getting it right (to correctly predict yes or no) and two ways of getting it wrong (to incorrectly predict yes or no). In this case, imagine a recommendation engine offering a yes or no prediction for a specific customer regarding 100 items compared to the customer’s actual responses. This table shows a set of possible results.
Example of Binary Classifier Results
Iterations = 100 AI (Predicted)
No Yes
Customer (Actual) No 35 10
Yes 5 50
A result can be true or false and positive or negative, giving four possibilities as shown in the next table.
Example Results Categories
Prediction Actual Category Percent
Yes Yes True positive 50
No No True negative 35
Yes No False positive 10
No Yes False negative 5
In this case, the model has an accuracy rate of 0.85. That number may be good or bad, depending on your requirements. You may move forward, or you may refine the model and try again.

Test and evaluate the model

What new information do you know now? Follow these steps:
  1. Evaluate the results.
  2. Review the process.
  3. Determine the next steps.
In the penultimate step, you go back to the beginning and compare your goals with the results to determine if they provided enough of the right kind of insight to allow you to answer your questions.

Deploy and integrate the model

What actions should you trigger with the new information? What needs human validation? Follow these steps:
  1. Plan the deployment.
  2. Plan monitoring and maintenance.
  3. Produce the final report and presentation.
  4. Review the project.
After you have an acceptable model, it’s time to roll out the information using the plan developed during the business understanding stage so your teams can execute on the insight the project has produced. Look for game-changing insights that will alter how you do business. You might change workflows, introduce automation at some points, and establish touchpoints for human intervention. You might introduce confidence scoring, with automated actions for outcomes above or below a window and human review for the middle ground.

Maintain the model

Because data has a shelf-life, no data science project can run under the set-and-forget philosophy. In the maintenance stage, your model must regularly retrain on fresh data so the answers reflect the new reality of now.

The final report can be as simple as a summary of the life of the project and the outcomes, or it can be an exhaustive analysis of the results, their implications, and your plans for implementing the insights.

It’s always a good idea to have a lessons-learned session after any significant effort, particularly if you plan to continue using it. This meeting can cover rabbit trails you followed and insights into best practices.

About This Article

This article is from the book:

About the book author:

Zachary Jarvinen, MBA/MSc is a product & marketing executive and sought-after author and speaker in the Enterprise AI space. Over the course of his career, he's headed up Technology Strategy for Artificial Intelligence and Analytics at OpenText, expanded markets for Epson, worked at the U.S. State Department, and was a member of the 2008 Obama Campaign Digital Team. Presently, Zachary is focused on helping organizations get tangible benefits from AI.

This article can be found in the category: