No matter how you use data, this cheat sheet will help you use it more effectively.
Common Forms of Errant Data
Data becomes less useful or possibly not useful at all when it fails to meet specific needs, such as correctness. For many people, error equates to wrong. However, in many cases, data is correct, yet also erroneous. A sales statistic may reflect reality for a particular group, but if that group isn’t part of your analysis, the data is incorrect for your needs despite being correct data. You can consider data errant when it meets any of these criteria:
- Incorrect: The data is actually wrong in some way.
- Missing: The data isn’t there to use, such as a field that someone didn’t fill out in a form.
- Wrong type: The data appears in a form that won’t work for your needs, such as numeric data that appears as a string rather than a number.
- Malformatted: The data is in the correct form, but isn’t formatted correctly, such as when you receive an older form of a state name, such as Wis rather than the necessary two-character form, WI. Often, this errant data occurs because of a misunderstanding or the use of an outdated standard.
- Wrong format for the task: The data is correct in every possible way except for being in task-specific format. For example, a date could appear in MM/DD/YY form when you need it in DD/MM/YY form. Because some dates, such as January 1, can be correct in other formats, this particular form of errant data is hard to track down.
- Incomplete: The data is correct to an extent, but something is missing. For example, you might need a four-digit year, but you receive only a two-digit year instead.
- Imprecise: The data isn’t sufficiently accurate for your task, such as when you receive an integer value in place of a floating-point value.
- Misaligned: The data is correct, is in the right form, and is even of the right precision, but it still won’t parse because of some type of shifting. For example, when working with text, perhaps two spaces appear after each period, rather than one space. A number field on a form might not convert to a number because someone added a space in the entry.
- Outdated: Data gets old just like anything else. Using old data will cause problems with your analysis unless you’re looking at it for historical purposes.
- Opinion, rather than fact: A fact is verifiable through some type of process and vetted through peer review. An opinion can inform or enlighten, but it won’t help your analysis.
- Misclassified: The data is useful in every possible way, except that it’s the wrong information. For example, you might find yourself using statistics on cat physiology when your intention was to research dogs.
Missing data will tend to skew or bias the results of any analysis you perform using it. Consequently, you must find a way to deal with the missing data or face the fact that your analysis will contain flaws. You have a few possible strategies to handle missing data effectively. Your strategy may change if you have to handle missing values of these types:
- Quantitative values: Data values expressed as numbers.
- Qualitative features: Data that refers to concepts. Even though you express them as numbers, their values are somewhat arbitrary, and you cannot meaningfully take an average or other computations on them.
When working with qualitative features, your value guessing should always produce integer numbers, based on the numbers used as codes. Here are common strategies for missing data handling:
- Replace missing values with a computed constant such as the mean or the median value. If your feature is a category, you must provide a specific value because the numbering is arbitrary, and using mean or median doesn’t make sense. Use this strategy when the missing values are random.
- Replace missing values with a value outside the normal value range of the feature. For instance, if the feature is positive, replace missing values with negative values. This approach works fine with decision tree–based algorithms and qualitative variables.
- Replace missing values with 0, which works well with regression models and standardized variables. This approach is also applicable to qualitative variables when they contain binary values.
- Interpolate the missing values when they are part of a series of values tied to time. This approach works only for quantitative values. For instance, if your feature is daily sales, you could use a moving average of the last seven days or pick the value at the same time as the previous week.
- Impute their value using the information from other predictor features (but never use the response variable). Particularly in R, specialized libraries like missForest, MICE and Amelia II can do everything for you. Scikit-learn recently introduced an experimental missing values imputer that allows imputing data in Python using Multivariate Imputation by Chained Equations (MICE), missForest, or even Amelia methodologies.
Another good practice is to create a new binary feature for each variable whose values you repaired. The binary variable will track variations that result from replacement or imputing with a positive value, and your machine learning algorithm can figure out when it must make additional adjustments to the values you actually used.
Consider Novelty Data Types
Experience teaches that the world is rarely stable. Sometimes novelties do naturally appear because the world is so mutable. Consequently, your data changes over time in unexpected ways, in both target and predictor variables. A target variable is the variable you want to know more about, and the predictor variable is the independent variable used to predict the target variable. This phenomenon is called concept drift. The term concept refers to your target and drift to the source data used to perform a prediction that moves in a slow but uncontrollable way, like a boat drifting because of strong tides.
To obtain a relevant and useful analysis, you check for any new data containing anomalies with respect to existing cases. Maybe you spent a lot of time cleaning your data or you developed a machine learning application based on available data, so it would be critical to figure out whether the new data is similar to the old data and whether the algorithms will continue working well in classification or prediction.
In such cases, data scientists talk of novelty detection, because they need to know how well the new data resembles the old. Being exceptionally new is considered an anomaly: Novelty may conceal a significant event or may risk preventing an algorithm from working properly because tasks such as machine learning rely heavily on learning from past examples, and the algorithm may not generalize to completely novel cases. When working with new data, you should retrain the algorithm. When considering a data science model, you distinguish between different concept drift and novelty situations using these criteria:
- Physical: Face or voice recognition systems, or even climate models, never really change. Don’t expect novelties, but check for outliers that result from data problems, such as erroneous measurements.
- Political and economic: These models sometimes change, especially in the long run. You have to keep an eye out for long-term effects that start slowly and then propagate and consolidate, rendering your models ineffective.
- Social behavior: Social networks and the language you use every day change over time. Expect novelties to appear and take precautionary steps; otherwise, your model will suddenly deteriorate and turn unusable.
- Search engine data, banking, and e-commerce fraud schemes: These models change quite often. You need to exercise extra care in checking for the appearance of novelties, which tell you it’s time to train a new model to maintain accuracy.
- Cyber security threats and advertising trends: These models change continuously. Spotting novelties is the norm, and reusing the same models over a long time is a hazard.
Choose the Correct Programming Language
Data scientists usually use only a few languages because they make working with data easier. The world holds many different programming languages, and most are designed to perform tasks in a certain way or even make a particular profession’s work easier to do. Choosing the correct tool makes your life easier. Using the wrong tool is akin to using a hammer rather than a screwdriver to drive a screw. Yes, the hammer works, but the screwdriver is much easier to use and definitely does a better job. Here are the top languages for data science work in order of preference:
- Python (general purpose): Many data scientists prefer to use Python because it provides a wealth of libraries, such as NumPy, SciPy, MatPlotLib, pandas, and Scikit-learn, to make data science tasks significantly easier. Python is also a precise language that makes using multiprocessing on large datasets easy, reducing the time required to analyze them. The data science community has also stepped up with specialized IDEs, such as Anaconda, that implement the Jupyter Notebook concept, which makes working with data science calculations significantly easier. Besides all of these things in Python’s favor, it’s also an excellent language for creating glue code with languages such as C/C++ and Fortran. The Python documentation shows how to create the required extensions. Most Python users rely on the language to see patterns, such as allowing a robot to see a group of pixels as an object. Python also sees use for all sorts of scientific tasks.
- R (special purpose statistical): In many respects, Python and R share the same sorts of functionality, but they implement it in different ways. Depending on which source you view, Python and R have about the same number of proponents, and some people use Python and R interchangeably (or sometimes in tandem). Unlike Python, R provides its own environment, so you don’t need a third-party product such as Anaconda. However, you can also use third party IDEs such as Jupyter Notebook so that you can use a single IDE for all your needs. Unfortunately, R doesn’t appear to mix with other languages with the ease that Python provides.
- SQL (database management): The most important thing to remember about Structured Query Language (SQL) is that it focuses on data rather than tasks. Businesses can’t operate without good data management — the data is the business. Large organizations use some sort of relational database, which is normally accessible with SQL, to store their data. Most Database Management System (DBMS) products rely on SQL as their main language, and the DBMS usually has a large number of data analysis and other data science features built in. Because you’re accessing the data natively, you often experience a significant speed gain in performing data science tasks this way. Database Administrators (DBAs) generally use SQL to manage or manipulate the data rather than necessarily perform detailed analysis of it. However, the data scientist can also use SQL for various data science tasks and make the resulting scripts available to the DBAs for their needs.
- Java (general purpose): Some data scientists perform other kinds of programming that require a general-purpose, widely adapted, and popular, language. In addition to providing access to a large number of libraries (most of which aren’t actually all that useful for data science but do work for other needs), Java supports object orientation better than any of the other languages in this list. In addition, it’s strongly typed and tends to run quite quickly. Consequently, some people prefer it for finalized code. Java isn’t a good choice for experimentation or ad hoc queries. Oddly enough, an implementation of Java for Jupyter Notebook exists, but it isn’t refined and is not usable for data science work at this time.
One thing to note about Java is that Microsoft is taking a significantly stronger interest in the language and that may spell some changes in the future. See these articles: Microsoft sails past Oracle in bringing Java SE to the cloud, Java on Visual Studio Code October Update, JAX London 2019 has begun: “Microsoft is now a Java shop”, and Episode 48. On Jakarta EE 9 Band-aids, OracleCodeOne Debrief, Unionizing Tech, IBM vs Microsoft and Oracle JDBC Drivers! for some ideas on changes that could take place.
- Scala (general purpose): Because Scala uses the Java Virtual Machine (JVM), it does have some of the advantages and disadvantages of Java. However, like Python, Scala provides strong support for the functional programming paradigm, which uses lambda calculus as its basis (see Functional Programmming For Dummies, by John Mueller [Wiley] for details). In addition, Apache Spark is written in Scala, which means that you have good support for cluster computing when using this language — think huge dataset support. Some of the pitfalls of using Scala are that it’s hard to set up correctly, it has a steep learning curve, and it lacks a comprehensive set of data science specific libraries.