EDA Techniques for Testing Assumptions

Alan Anderson

David Semmelroth

Updated

2016-03-26 07:28:33

From the book

Statistics for Big Data For Dummies

Download E-Book

Statistics for Big Data For Dummies

Explore Book

Download E-Book

Statistics for Big Data For Dummies

Explore Book

There are several Exploratory Data Analysis (EDA) techniques you can use to test assumptions about a dataset. These include run sequence plot, lag plot, histogram, and normal probability plot.

Run sequence plot

Many statistical techniques are based on the assumption that the data being analyzed has the following properties:

Independent variables
Variables drawn from a common probability distribution
Variables with common parameters (for example, mean and standard deviation)

A run sequence plot tests whether the data conforms to these assumptions. For example, the following figure shows a run sequence plot for the daily returns to the Standard and Poor's stock market index.

Run sequence plot of daily returns to the S&P 500.

Because this is a time series plot, it's being used to determine whether the returns to the S&P 500 are independent of each other, whether they are all drawn from the same probability distribution, and whether the parameters (mean and variance) remain constant over time.

The run sequence plot is designed to answer these questions:

Are there any changes in the mean of the data?
Are there any changes in the variance of the data?

In addition, you use the run sequence plot to identify any outliers in the data.

The plot of the returns to the S&P 500 shows that the mean and variance of the data remain stable over time, and that there do not appear to be any outliers.

Lag plot

A lag plot determines whether the elements of a dataset are random (independent of each other). In other words, the plot shows whether or not there's a pattern in the data. Patterns in the data are inconsistent with randomness.

A lagged value is one that has occurred in the past. A lag of 1 refers to an observation that has taken place one period in the past. A lag of 2 refers to an observation that has taken place two periods in the past, and so forth.

A lag plot shows the values of a variable on the vertical axis, and the lagged values of the same variable on the horizontal axis. For example, this figure shows a lag plot for the daily returns to the Standard and Poor's stock market index.

Lag plot of daily returns to the Standard and Poor's 500 in 2013.

The points on this plot are randomly scattered with no particular pattern. This is consistent with the assumption of randomness in the data.

Histogram

You can use a histogram to identify the distribution followed by a dataset. A histogram can show several key details about a dataset, including the following:

The center of the data
The spread (variability) of the data
The skewness of the data (if any)
The presence of outliers

For example, this figure shows a histogram for the daily returns to the Standard and Poor's stock market index.

Histogram of daily returns to the S&P 500.

The graph shows that the Standard and Poor's returns have a mean of approximately 0 — the heights of the bars are greatest near 0. The returns appear to exhibit negative skewness (that is, extreme negative returns are more common than extreme positive returns) and have a greater magnitude. There do not appear to be any outliers in the data.

Normal probability plot

Use a normal probability plot to compare a dataset to the normal distribution. The vertical axis of this plot shows the quantiles of the dataset, and the horizontal axis shows the quantiles of the normal distribution. If a dataset is normally distributed, then the graph should appear to be a straight line with a slope of 1.

Quantiles are used to divide a dataset into equally sized groups. A widely used type of quantile is the quartile, which (as discussed earlier) divides a dataset into four equal groups, each consisting of 25 percent of the data. Another popular choice is the percentile, which divides a dataset into one hundred equal groups, each consisting of 1 percent of the data.

The following figure shows a normal probability plot for the daily returns to the Standard and Poor's stock market index.

Normal probability plot of daily returns to the S&P 500 in 2013.

The plot shows that the returns to the S&P 500 are close to being normal, with deviations in the tails of the distribution.

About This Article

About the book author:

Alan Anderson, PhD is a teacher of finance, economics, statistics, and math at Fordham and Fairfield universities as well as at Manhattanville and Purchase colleges. Outside of the academic environment he has many years of experience working as an economist, risk manager, and fixed income analyst. Alan received his PhD in economics from Fordham University, and an M.S. in financial engineering from Polytechnic University.

David Semmelroth has two decades of experience translating customer data into actionable insights across the financial services, travel, and entertainment industries. David has consulted for Cedar Fair, Wachovia, National City, and TD Bank.

This article can be found in the category:

Big Data

Hot off the press

Explore Related content

Statistics for Big Data For Dummies

Big Data For Dummies

Big Data For Small Business For Dummies

Book & Article Categories

Book & Article Categories

Collections

EDA Techniques for Testing Assumptions

Run sequence plot

Lag plot

Histogram

Normal probability plot

About This Article

About the book author:

This article can be found in the category:

Explore Related content

Book & Article Categories

Book & Article Categories

Collections

EDA Techniques for Testing Assumptions

Run sequence plot

Lag plot

Histogram

Normal probability plot

About This Article

This article is from the book:

About the book author:

This article can be found in the category:

Explore Related content

Beyond Boundaries: Unstructured Data Orchestration

Big Data For Dummies Cheat Sheet

Statistics for Big Data For Dummies Cheat Sheet

Big Data for Small Business For Dummies Cheat Sheet

Integrate Big Data with the Traditional Data Warehouse

Best Practices for Big Data Integration

How to Analyze Big Data to Get Results

Big Data Planning Stages

Ten Hot Big Data Trends

Explore the Big Data Stack

Defining Big Data: Volume, Velocity, and Variety

Understanding Unstructured Data

Basics of Big Data Infrastructure

The Role of Traditional Operational Data in the Big Data Environment

Laying the Groundwork for Your Big Data Strategy

Managing Big Data with Hadoop: HDFS and MapReduce

Identify the Data You Need for Your Big Data

Layer 2 of the Big Data Stack: Operational Databases

Manage Virtualization for Big Data

Layer 4 of the Big Data Stack: Analytical Data Warehouses