Statistics for Big Data For Dummies

Statistical software packages are extremely powerful these days, but they cannot overcome poor quality data. Following is a checklist of things you need to do before you go off building statistical models.

Check data formats

Your analysis always starts with a raw data file. Raw data files come in many different shapes and sizes. Mainframe data is different than PC data, spreadsheet data is formatted differently than web data, and so forth. And in the age of big data, you will surely be faced with data from a variety of sources. Your first step in analyzing your data is making sure you can read the files you're given.

You need to actually look at what each field contains. For example, it's not wise to trust that just because a field is listed as a character field, it actually contains character data.

Verify data types

All data falls into one of four categories that affect what sort of statistics you can appropriately apply to it:

• Nominal data is essentially just a name or an identifier.

• Ordinal data puts records into order from lowest to highest.

• Interval data represents values where the differences between them are comparable.

• Ratio data is like interval data except that it also allows for a value of 0.

It's important to understand which categories your data falls into before you feed it into the statistical software. Otherwise, you risk ending up with perfectly reasonable-looking gibberish.

Getting a sense of how your data is distributed is important. You can run statistical procedures until you're blue in the face, but none of them will give you as much insight into what your data looks like as a simple graph.

Verify data accuracy

Once you're comfortable that the data is formatted the way you want it, you still need to make sure it's accurate and that it makes sense. This step requires that you have some knowledge of the subject area you are working in.

There isn't really a cut-and-dried approach to verifying data accuracy. The basic idea is to formulate some properties that you think the data should exhibit and test the data to see if those properties hold. Are stock prices always positive? Do all the product codes match the list of valid ones? Essentially, you're trying to figure out whether the data really is what you've been told it is.

Identify outliers

Outliers are data points that are out of whack with the rest of the data. They are either very large or very small values compared with the rest of the dataset.

Outliers are problematic because they can seriously compromise statistics and statistical procedures. A single outlier can have a huge impact on the value of the mean. Because the mean is supposed to represent the center of the data, in a sense, this one outlier renders the mean useless.

When faced with outliers, the most common strategy is to delete them. In some cases, though, you may want to take them into account. In these cases, it's usually desirable to do your analysis twice — once with outliers included and once with the outliers excluded. This allows you to evaluate which method gives more useful results.

Deal with missing values

Missing values are one of the most common (and annoying) data problems you will encounter. Your first impulse might be to drop records with missing values from your analysis. The problem with this is that missing values are frequently not just random little data glitches.

Many statistical procedures depend on the assumption that the data is distributed in a certain way. If that assumption fails to be the case, the accuracy of your predictions suffers.

The most common assumption for the modeling techniques discussed in this book is that the data is normally distributed.

Or not. In cases where the data isn't distributed as you need it to be, all is not necessarily lost. There are a variety of ways of transforming data to get the distribution into the shape you need it.

One of the best ways to verify the accuracy of a statistical model is to actually test it against the data once it's built. One way to do that is to randomly split your dataset into two files. You might call these files Analysis and Test, respectively.

You need to split the data randomly to be effective. You cannot simply split the dataset into the top half and the bottom half, for example. Almost all data files are sorted somehow — by date if nothing else. This introduces systematic patterns that will give different portions of the file different statistical properties. When you split the file randomly, you give each record an equal chance of being in either file. Figuratively, you're flipping a coin for each record to decide which file it goes into. Randomness gives both files the same statistical properties as the original data.

Once you have split the dataset, set aside the Test file. Then proceed to build your predictive model using the Analysis file. Once the model is built, apply it to the Test file and see how it does.

Testing models in this way helps safeguard against a phenomenon known as over-fitting. Essentially, it's possible for statistical procedures to memorize the data file rather than discover meaningful relationships among the variables. If over-fitting occurs, the model will test quite poorly against the Test file.

Back up and document everything you do

Because statistical software is getting to be so simple to use, it's a piece of cake to start generating reports and graphs, not to mention data files. You can run procedures literally at the touch of a button. You can generate several dozen graphs based on different data transformations in a matter of a few minutes. That makes it pretty easy to lose track of what you have done, and why.

It's important to make sure you keep a written record of what you're up to. Graphs should be labeled with the name (and version) of the data that was used to create them. Statistical procedures that you build need to be saved and documented.

It's also important to back up your data files. In the course of your analysis, you will likely create several versions of your data that reflect various corrections and transformation of variables. You should save the procedures that created these versions. They should also be documented in a way that describes what transformations you have made and why.

Documentation isn't anyone's favorite task, but we speak from experience when we strongly encourage you not to rely on your memory when it comes to your analysis projects.

By working through the steps just described, you maximize the reliability of your statistical models. In many cases, the prep work is actually more time-consuming than the actual model building. But it's necessary. And you'll thank yourself in the end for working through it methodically.