Data Science: Contextualizing Problems and Data in Python
Putting your problem in the correct context is an essential part of developing a data science solution with Python for any given problem and associated data. Data science is definitively applied science, and abstract manual approaches may not work all that well on your specific situation.
Running a Hadoop cluster or building a deep neural network may sound cool in front of fellow colleagues and make you feel you are doing great data science projects, but they may not provide what you need to solve your problem.
Putting the problem in the correct context isn’t just a matter of deliberating whether to use a certain algorithm or that you must transform the data in a certain way — it’s the art of critically examining the problem and the available resources and creating an environment in which to solve the problem and obtain a desired solution.
The key point here is the desired solution, in that you could come up with solutions that aren’t desirable because they don’t tell you what you need to know — or, even when they do tell you what you need to know, they waste too much time and resources.
Evaluating a data science problem
When working through a data science problem, you need to start by considering your goal and the resources you have available for achieving that goal. The resources are data, computational resources such as available memory, CPUs, and disk space.
Most of the time, you have to face completely new problems, and you have to build your solution from scratch. During your first evaluation of a data science problem, you need to consider the following:
The data available in terms of accessibility, quantity, and quality. You must also consider the data in terms of possible biases that could influence or even distort its characteristics and content. Data never contains absolute truths, only relative truths that offer you a more or less useful view of a problem. Always be aware of the truthfulness of data and apply critical reasoning as part of your analysis of it.
The methods you can feasibly use to analyze the dataset. Consider whether the methods are simple or complex. You must also decide how well you know a particular methodology. Start by using simple approaches, and never fall in love with any particular technique. There are neither free lunches nor Holy Grails in data science.
The questions you want to answer by performing your analysis and how you can quantitatively measure whether you achieved a satisfactory answer to them. “If you can not measure it, you can not improve it,” as Lord Kelvin stated. If you can measure performance, you can determine the impact of your work and even make a monetary estimation. Stakeholders will be delighted to know that you’ve figured out what to do.
Data science is a complex system of knowledge at the intersection of computer science, math, statistics, and business. If someone has already faced the same problem or dilemmas as you face, reinventing the wheel makes little sense. Now that you have contextualized your project, you know what you’re looking for and you can search for it in different ways.
Check the Python documentation. You might be able to find examples that suggest a possible solution. NumPy, SciPy, pandas, and especially Scikit-learn have detailed in-line and online documentation with plenty of data science-related examples.
Seek out online articles and blogs that hint at how other practitioners solved similar problems. Q&A websites such as Quora, Stack Overflow, and Cross Validated can provide you with plenty of answers to similar problems.
Consult academic papers. For example, you can query your problem on Google Scholar or Microsoft Academic Search. You can find a series of scientific papers that can tell you about paring the data or detail the kind of algorithms that work better for a particular problem.
It may seem trivial, but the solutions you create have to reflect the problem you’re trying to solve. As you research solutions, you may find that some of them seem promising at first, but then you can’t successfully apply them to your case because something in their context is different.
For instance, your dataset may be incomplete or may not provide enough input to solve the problem. In addition, the analysis model you select may not actually provide the answer you need or the answer might prove inaccurate. As you work through the problem, don’t be afraid to perform your research multiple times as you discover, test, and evaluate possible solutions that you could apply given the resources available and your actual constraints.
Formulating a hypothesis
At some point, you have everything you think you need to solve the problem. Of course, it’s a mistake to assume now that the solutions you create can actually solve the problem. You have a hypothesis, rather than a solution, because you have to demonstrate the efficacy of the potential solution in a scientific way. In order to form and test a hypothesis, you must train a model using a training dataset and then test it using an entirely different dataset.
Paring your data
After you have some idea of the problem and its solution, you know the inputs required to make the algorithm work. Unfortunately, your data probably appears in multiple forms, you get it from multiple sources, and some data is missing entirely. Moreover, the developers of the features that existing data sources provide may have devised them for different purposes than yours and you have to transform them so that you can use your algorithm at its fullest power.
To make the algorithm work, you must pare the data. This means checking for missing data, creating new features as needed, and possibly manipulating the dataset to get it into a form that your algorithm can actually use to make a diction.