Important Statistical Formulas for Big Data

By Mico Yuk, Stephanie Diamond

The word statistics may evoke fear in some beginners to data visualization, but if you ignore this topic, you overlook one of the most powerful ways to derive true insight and value from Big Data.

Statistics is the practice or science of collecting numerical data in large quantities. You don’t have to go out and become a data scientist (a term used for statisticians who are also data geeks in disguise and who usually hold some type of advanced degree, such as a PhD), but you may want to consider picking up a Statistics 101 book or class if you have any interest.

Statistical formulas such as probability, variance, and forecast are popular today. They’re fairly easy to apply to any data set, and most readers will clearly understand them. You can incorporate some of these statistical formulas into your Big Data visualizations to provide true value to users by using the techniques discussed in the following sections.

Knowing the probability that an event will occur

One statistical formula that you may be familiar with is probability — the likelihood or chance that an event may occur. The following formula calculates basic probability for a linear scenario. (Nonlinear scenarios are a bit complex and too much of an undertaking for a newbie.)

Probability = Probability an Event Will Occur / Number of Possible Outcomes

The following figure shows a probability with some alert colors added to make the message easy to read and, most important, to clearly indicate that immediate action is needed.


Probabilities provide a quick reality check and set the overall tone for the story the data visualization will be providing during a given period (day, week, quarter, and so on).

Applying variance to show the magnitude of change

Another popular statistical measure is variance, which is the difference between a set of data points.

The most commonly used formula for calculating variance is

Variance = Final Desired – Current State

Whether the output displayed is a whole number or percentage, the formula shows the magnitude of change between the beginning and ending state of a data point.

Displaying the variance is always a quick win and a great substitute for the line/bar chart combo, which is how the variance relationship is displayed in most visualizations.

The chart in the figure below shows a line/bar chart combo that lets the user decipher the variance for each month.


The second chart, shown in the following figure, clearly plots the variance and takes all the guesswork out of the visual.


Forecasting the future

Yet another popular statistical formula that you may be familiar with is the forecast, which is the act of predicting or estimating an event or trend.

When you calculate a forecast, you’re really using a certain amount of historic data to predict behavior, a specific event, or a trend. For example, you could calculate the sales for the year based on the historic fact that January usually accounts for 5% of the sales. If you made $500 in sales in January then you would use the following formula to forecast how much sales you can anticipate for the year :

$500 / .05 = $10,000

In this equation, $500 is the sales in January; .05 is the historic percentage of sales that January accounts for; and $10,000 is the projected sales for the year.

The figure below shows how forecasts are displayed in most data visualizations as a simple line in a chart. Forecasts indicate how a given activity may perform in the future.

This typical display of a forecast line shows that cash flow will eventually become an issue for th

This typical display of a forecast line shows that cash flow will eventually become an issue for this organization.