Data Science Strategy For Dummies Cheat Sheet
A revolutionary change is taking place in society and it involves data science. Everybody, from small local companies to global enterprises, is starting to realize the potential of data science and is seeing the value in digitizing their data assets and becoming data driven. Regardless of industry, companies have embarked on a similar journey to explore how to drive new business value by utilizing analytics, machine learning (ML), and artificial intelligence (AI) techniques and introducing data science as a new discipline.
However, although utilizing these new technologies will help companies simplify their operations and drive down costs, nothing is simple about getting the strategic approach right for your data science investment. This cheat sheet gives you a peak at the fundamental concepts you need to be on top of when building your data science strategy. It looks not only at investing in a top performing data science team, but also what to consider in your data architecture and how to approach the commercial aspects of data science.
Data Science Strategy: Machine Learning Basics
People often ask me to explain the difference between advanced analytics and machine learning and to say when it is advisable to go for one approach or the other. It’s always good to start out by defining machine learning. Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task. Machine learning algorithms build a mathematical model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to perform the task.
So, here’s how advanced analytics and machine learning have some characteristics in common:
- Both advanced analytics and machine learning techniques are used for building and executing advanced mathematical and statistical models as well as building optimized models that can be used to predict events before they happen.
- Both methods use data to develop the models, and both require defined model policies.
- Automation can be used to run both analytics models and machine learning models after they’re put into production.
What about the differences between advanced analytics and machine learning?
- There is a difference in who the actor is when creating your model. In an advanced analytics model, the actor is human; in a machine learning model, the actor is (obviously) a machine.
- There is also a difference in the model format. Analytics models are developed and deployed with the human-defined design, whereas machine learning models are dynamic and change design and approach as they’re being trained by the data, optimizing the design along the way. Machine learning models can also be deployed as dynamic, which means that they continue to train, learn and optimize the design when exposed to real-life data and its live context.
- Another difference between analytical models and machine learning models regards the difference in how models are tested using data (for analytics) and trained using data (for machine learning). In analytics data is used to test that the defined outcome is achieved as expected, while in machine learning, the data is used to train the model to optimize its design depending on the nature of the data.
- Finally, the techniques and tools used to develop advanced analytics models and machine learning models differ. Machine learning modeling techniques are much more advanced and are built on other principles related to how the machine will learn to optimize the model performance.
What Does it Mean to Be a Data-Driven Organization?
Data is the new black! Or the new oil! Or the new gold! Whatever you compare data to, it’s probably true from a conceptual value perspective. As a society, we have now entered a new era of data and intelligent machines. And data science isn’t a passing trend or something that you can or should avoid. Instead, you should embrace it and ask yourself whether you understand enough about it to leverage it in your business. Be open-minded and curious! Dare to ask yourself whether you truly understand what being data-driven means.
If you start by putting the ongoing changes happening in society into a wider context, it’s a common understanding that we humans are now experiencing a fourth industrial revolution, driven by access to data and advanced technology. It’s also referred to as the digital revolution. But beware! Digitizing or digitalizing your business isn’t the same as being data-driven.
Digitization is a widely used concept that basically refers to transitioning from analog to digital, like the conversion of data to a digital format. In relation to that, digitalization refers to making the digitized information work in your business.
The concept of digitalizing a business is sometimes mixed up with being data-driven. However, it’s vital to remember that digitalizing the data isn’t just a good thing to do — it’s the foundation for enabling a data-driven enterprise. Without digitalization, you simply cannot become data-driven.
In a data-driven organization, the starting point is data. It’s truly the foundation of everything. But what does that actually mean? Well, being data-driven means that you need to be ready to take data seriously. And what does that mean? Well, in practice, it means that data is the starting point and you analyze data and understand what type of business you should be doing. You must take the outcome of the analysis seriously enough to be prepared to change your business models accordingly. You must be ready to trust and use the data to drive your business forward. It should be your main concern in the company. You need to become “data-obsessed.”
Defining and Scoping a Data Science Strategy
There’s a difference between a data science strategy and a data strategy. On a high-level, a data science strategy refers to the strategy you define with regards to the entire data science investment in your company. A data science strategy includes areas such as overall data science objectives and strategic choices, regulatory strategies, data need, competences and skillsets, data architecture, as well as how to measure the outcome.
The data strategy on the other hand, constitutes a subset of the data science strategy, and is focused on outlining the strategic direction directly related to the data. This includes areas such as data scope, data consent, legal, regulatory and ethical considerations, storage collection frequency, data storage retention periods, data management process and principles, and last, but not least; data governance.
Both strategies are needed in order to succeed with your data science investment and should complement each other in order to work.
If you ask about the objectives of a data science strategy, you’re asking whether there are clear company objectives set and agreed on for any of the investments made in data science. Are the objectives of your data science strategy formulated in a way that makes them possible to execute and measure success by? If not, then the objectives need to be reformulated; this is a critically important starting point that must be completed properly in order to succeed down the line.
Data science is a new field that holds amazing opportunities for companies to drive a fundamental transformation, but it is complex and often not fully understood by top management. You should consider whether the executive team’s understanding of data science is sufficient to set the right targets or whether they need to be educated and then guided in setting their target.
Whether you’re a manager or an employee in a small or large company, if you want your company to succeed with its data science investment, don’t sit and hope that the leadership of your company will understand what needs to be done. If you’re knowledgeable in the area, make your voice heard or, if you aren’t, don’t hesitate to accept help from those who have experience in the field.
If you decide to bring in external experts to assist you in your data science strategizing, be sure to read up on the area yourself first, so that you can judge the relevance of their recommendations for your business — the place where you are the expert.
The Basics of Data for Data Science
The terms data and information are often used interchangeably; there is a difference between them, however. For example, data can be described as raw, unorganized facts that need to be processed — a collection of numbers, symbols, or characters before it has been cleaned and corrected. Raw data needs to be corrected to remove flaws like outliers and data entry errors.
Raw data can be generated in many different ways. Field data, for example, is raw data that has been collected in an uncontrolled live environment. Experimental data has been generated within the context of a scientific investigation by observation and recording. Data can be as simple and seemingly random and useless until it’s organized, but once data is processed, organized, structured, or presented in a given context that makes it useful, it’s called information.
Historically, the concept of data has been most closely associated with scientific research, but now data is being collected, stored, and used by an increasing number of companies, organizations, and institutions. For companies, examples of interesting data can be customer data, product data, sales data, revenue, and profits; for governments, it can include data such as crime rates and unemployment rates.
During the second half of the 1900s, there were several attempts to standardize the categorization and structure of data in order to make sense of its various forms. One well-known model for this is the DIKW (data, information, knowledge, and wisdom) pyramid, described in the following list; the first version of this model was drafted already in the mid-1950s, but it first appeared in its current state in the mid-1990s, as an attempt to make sense of the growing amounts of data (raw or processed) that were being generated from different computer systems:
- Data is raw. It simply exists and has no significance beyond its existence (in and of itself). It can exist in any form, usable or not. Data represents a fact or statement of event without relation to other factors — it’s raining, for example.
- Information is data that has been given a meaning by way of some sort of relationship. This meaning can be useful, but does not have to be. The information relationship can be related to cause-and-effect —the temperature dropped 15 degrees and then it started raining, for example.
- Knowledge is the collection of information with the purpose to be useful. It represents a pattern that connects discrete elements and generally provides a high level of predictability for what is described or what will happen next: If the humidity is very high and the temperature drops substantially, the atmosphere is often unlikely to be able to hold the moisture, and so it rains, for example.
- Wisdom exemplifies more of an understanding of fundamental principles within the knowledge that essentially form the basis of the knowledge being what it is. Wisdom is essentially like a shared understanding that is not questioned; It rains because it rains, for example. And this encompasses an understanding of all interactions that happen between raining, evaporation, air currents, temperature gradients, changes, and rain.
The DIKW pyramid offered a new way to categorize data as it passes through different stages in its life cycle and has gained some attention over the years. However, it has also been criticized, and variants have appeared that were designed to improve on the original. One major criticism has been that, although it’s easy enough to understand the step from data to information, it’s much harder to draw a clear and valid line from information to knowledge and from knowledge to wisdom, making it difficult to apply in practice.
Conceptual models are heuristic devices: They’re useful only insofar as they offer a way to learn something new. One model or another may be more appealing to you, but from the perspective of a data science implementation, the most important thing for you to consider is a question like this: Will my company gain value from having the four levels of the DIKW pyramid, or will it just make implementation more difficult and complex?
Knowing the Value of Data
The statement “Data is the new oil” is one that lots of people make, but what does it mean? In some ways, the analogy does fit: It’s easy to draw parallels because of the way information (data) is used to drive much of the transformative technology available today via artificial intelligence, machine learning, automation, and advanced analytics — much like oil drives the global industrial economy.
So, as a marketing approach and a high-level description, the expression does its job, but if you take it as an indication of how to strategically address the value of data, it might lead to investments that cannot be turned into value. For example, storing data has no guaranteed future value, like oil has. Storing even more data has even less value because it becomes even more difficult to find it so that you can put it to use. The value in data lies not in saving it up or storing it — it lies in putting it to use, over and over again. That´s when the value in data is realized.
If you start by looking at the core of the analogy, you can see that it refers to the value aspects of data as an enabler of a fundamental transformation of society — just like oil has proven to be throughout history. From that perspective, it definitely showcases the similarities between oil and data. Another similarity is that, although inherently valuable, data needs processing — just as oil needs refining — before its true value can be unlocked.
However, data also has many other aspects that cause the analogy to fall apart when examined more closely. To see what this means, check out some of the differences between these two enablers of transformation:
- Availability: Though oil is a finite resource, data is an endless and constantly increasing resource. This means that treating data like oil (hoarding it and storing it in siloes, for example) has little benefit and reduces its usefulness. Nevertheless, because of the misconception that data is similar to oil (scarce), this is often exactly what is done with the data, driving investments and behavior in the wrong direction.
- Reusability: Data becomes more useful the more it’s used, which is the exact opposite of what happens with oil. When oil is used to generate energy like heat or light, or when oil is permanently converted into another form such as plastic, the oil is gone and cannot be reused. Therefore, treating data like oil — using it once and then assuming that its usefulness has been exhausted and disposing of it — is definitely a mistake.
- Capture: Everyone knows that as the world’s oil reserves decline, extracting it becomes increasingly difficult and expensive. With data, on the other hand, it’s becoming increasingly available as the digitalization of society increases.
- Variety: Data also has far more variety than oil. The raw oil that’s drilled from the ground is processed in a variety of ways into many different products, of course, but in its raw state, it’s all the same. Data in its raw format can represent words, pictures, sounds, ideas, facts, measurements, statistics, or any other characteristic that can be processed by computers.
The fact nevertheless remains that the quantities of data available today comprise an entirely new commodity, though the rules for capturing, storing, treating, and using data are still being written. Let’s stress here, however, that data, like oil, is a vital source of power and that the companies that utilize the available data in the most optimized way (thereby controlling the market) are establishing themselves as the leaders of the world economy, just as the oil barons did a hundred years ago.