David Semmelroth has two decades of experience translating customer data into actionable insights across the financial services, travel, and entertainment industries. David has consulted for Cedar Fair, Wachovia, National City, and TD Bank.
Summary statistical measures represent the key properties of a sample or population as a single numerical value. This has the advantage of providing important information in a very compact form. It also simplifies comparing multiple samples or populations. Summary statistical measures can be divided into three types: measures of central tendency, measures of central dispersion, and measures of association.
The two basic types of probability distributions are known as discrete and continuous. Discrete distributions describe the properties of a random variable for which every individual outcome is assigned a positive probability.
A random variable is actually a function; it assigns numerical values to the outcomes of a random process.
Hypothesis testing is a statistical technique that is used in a variety of situations. Though the technical details differ from situation to situation, all hypothesis tests use the same core set of terms and concepts. The following descriptions of common terms and concepts refer to a hypothesis test in which the means of two populations are being compared.
Several different types of graphs may be useful for analyzing data. These include stem-and-leaf plots, scatter plots, box plots, histograms, quantile-quantile (QQ) plots, and autocorrelation plots.
A stem-and-leaf plot consists of a “stem” that reflects the categories in a data set and a “leaf” that shows each individual value in the data set.
One important way to draw conclusions about the properties of a population is with hypothesis testing. You can use hypothesis tests to compare a population measure to a specified value, compare measures for two populations, determine whether a population follows a specified probability distribution, and so forth.
Measures of association quantify the strength and the direction of the relationship between two data sets. Here are the two most commonly used measures of association:
Covariance
Correlation
Both measures are used to show how closely two data sets are related to each other. The main difference between them is the units in which they are measured.
Measures of central tendency show the center of a data set. Three of the most commonly used measures of central tendency are the mean, median, and mode.
Mean
Mean is another word for average. Here is the formula for computing the mean of a sample:
With this formula, you compute the sample mean by simply adding up all the elements in the sample and then dividing by the number of elements in the sample.
Measures of central dispersion show how "spread out" the elements of a data set are from the mean. Three of the most commonly used measures of central dispersion include the following:
Range
Variance
Standard deviation
Range
The range of a data set is the difference between the largest value and the smallest value.
A box plot is designed to show several key statistics for a dataset in the form of a vertical rectangle or box. The statistics it can show include the following:
Minimum value
Maximum value
First quartile (Q1)
Second quartile (Q2)
Third quartile (Q3)
Interquartile range (IQR)
The first quartile of a dataset is a numerical measure that divides the data into two parts: the smallest 25 percent of the observations and the largest 75 percent of the observations.
For a dataset that consists of observations taken at different points in time (that is, time series data), it's important to determine whether or not the observations are correlated with each other. This is because many techniques for modeling time series data are based on the assumption that the data is uncorrelated with each other (independent).