Statistics for Big Data For Dummies Cheat Sheet

Alan Anderson

David Semmelroth

Updated

2022-03-10 20:12:30

From the book

Statistics for Big Data For Dummies

Download E-Book

Statistics for Big Data For Dummies

Explore Book

Download E-Book

Statistics for Big Data For Dummies

Explore Book

Summary statistical measures represent the key properties of a sample or population as a single numerical value. This has the advantage of providing important information in a very compact form. It also simplifies comparing multiple samples or populations. Summary statistical measures can be divided into three types: measures of central tendency, measures of central dispersion, and measures of association.

Measures of central tendency

Measures of central tendency show the center of a data set. Three of the most commonly used measures of central tendency are the mean, median, and mode.

Mean

Mean is another word for average. Here is the formula for computing the mean of a sample:

With this formula, you compute the sample mean by simply adding up all the elements in the sample and then dividing by the number of elements in the sample.

Here is the corresponding formula for computing the mean of a population:

Although the notation is slightly different, the procedure for computing a population mean is the same as the procedure for computing a sample mean.

Greek letters are used to describe populations, whereas Roman letters are used to describe samples.

Median

The median of a data set is a value that divides the data into two equal halves. In other words, half of the elements of a data set are less than the median, and the remaining half are greater than the median. The procedure for computing the median is the same for both samples and populations.

Mode

The mode of a data set is the most commonly observed value in the data set. You determine the mode in the same way for a sample and a population.

Measures of central dispersion

Measures of central dispersion show how “spread out” the elements of a data set are from the mean. Three of the most commonly used measures of central dispersion include the following:

Range
Variance
Standard deviation

Range

The range of a data set is the difference between the largest value and the smallest value. You compute it the same way for both samples and populations.

Variance

You can think of the variance as the average squared difference between the elements of a data set and the mean. The formulas for computing a sample variance and a population variance are slightly different.

Here is the formula for computing sample variance:

And here is the formula for computing population variance:

Standard deviation

The standard deviation is simply the square root of the variance. It’s more commonly used as a measure of dispersion than the variance because it’s measured in the same units as the elements of the data set, whereas the variance is measured in squared units.

Measures of association

Measures of association quantify the strength and the direction of the relationship between two data sets. Here are the two most commonly used measures of association:

Covariance
Correlation

Both measures are used to show how closely two data sets are related to each other. The main difference between them is the units in which they are measured. The correlation measure is defined to assume values between –1 and 1, which makes interpretation very easy.

Covariance

The covariance between two samples is computed as follows:

The covariance between two populations is computed as follows:

Correlation

The correlation between two samples is computed like this:

The correlation between two populations is computed like this:

About This Article

About the book author:

Alan Anderson, PhD is a teacher of finance, economics, statistics, and math at Fordham and Fairfield universities as well as at Manhattanville and Purchase colleges. Outside of the academic environment he has many years of experience working as an economist, risk manager, and fixed income analyst. Alan received his PhD in economics from Fordham University, and an M.S. in financial engineering from Polytechnic University.

David Semmelroth has two decades of experience translating customer data into actionable insights across the financial services, travel, and entertainment industries. David has consulted for Cedar Fair, Wachovia, National City, and TD Bank.

This article can be found in the category:

Big Data

Hot off the press

Explore Related content

Statistics for Big Data For Dummies

Big Data For Dummies

Big Data For Small Business For Dummies

Book & Article Categories

Book & Article Categories

Collections

Statistics for Big Data For Dummies Cheat Sheet

Measures of central tendency

Mean

Median

Mode

Measures of central dispersion

Range

Variance

Standard deviation

Measures of association

Covariance

Correlation

About This Article

About the book author:

This article can be found in the category:

Explore Related content

Book & Article Categories

Book & Article Categories

Collections

Statistics for Big Data For Dummies Cheat Sheet

Measures of central tendency

Mean

Median

Mode

Measures of central dispersion

Range

Variance

Standard deviation

Measures of association

Covariance

Correlation

About This Article

This article is from the book:

About the book author:

This article can be found in the category:

Explore Related content

Beyond Boundaries: Unstructured Data Orchestration

Big Data For Dummies Cheat Sheet

Statistics for Big Data For Dummies Cheat Sheet

Big Data for Small Business For Dummies Cheat Sheet

Integrate Big Data with the Traditional Data Warehouse

Best Practices for Big Data Integration

How to Analyze Big Data to Get Results

Big Data Planning Stages

Ten Hot Big Data Trends

Explore the Big Data Stack

Defining Big Data: Volume, Velocity, and Variety

Understanding Unstructured Data

Basics of Big Data Infrastructure

The Role of Traditional Operational Data in the Big Data Environment

Laying the Groundwork for Your Big Data Strategy

Managing Big Data with Hadoop: HDFS and MapReduce

Identify the Data You Need for Your Big Data

Layer 2 of the Big Data Stack: Operational Databases

Manage Virtualization for Big Data

Layer 4 of the Big Data Stack: Analytical Data Warehouses