By Deborah J. Rumsey, David Unger

Formulas — you just can’t get away from them when you’re studying statistics. Here are ten statistical formulas you’ll use frequently and the steps for calculating them.

Proportion

Some variables are categorical and identify which category or group an individual belongs to. For example, “relationship status” is a categorical variable, and an individual could be single, dating, married, divorced, and so on.

The actual number of individuals in any given category is called the frequency for that category. A proportion, or relative frequency, represents the percentage of individuals that falls into each category. The proportion of a given category, denoted by p, is the frequency divided by the total sample size.

So to calculate the proportion, you

  1. Count up all the individuals in the sample who fall into the specified category.

  2. Divide by n, the number of individuals in the sample.

Mean

The mean, or the average of a data set, is one way to measure the center of a numerical data set. The notation for the mean is

image0.png

The formula for the mean is

image1.png

where x represents each of the values in the data set.

To calculate the mean, you

  1. Add up all the numbers in the data set.

  2. Divide by n, the number of values in the data set.

Median

The median of a numerical data set is another way to measure the center. The median is the middle value after you order the data from smallest to largest.

To calculate the median, go through the following steps:

  1. Order the numbers from smallest to largest.

  2. For an odd amount of numbers, choose the one that falls exactly in the middle. You’ve pinpointed the median.

  3. For an even amount of numbers, take the two numbers exactly in the middle and average them to find the median.

Sample standard deviation

The standard deviation of a sample is a measure of the amount of variability in the sample. You can think of it, in general terms, as the average distance from the mean. The formula for the standard deviation is

image2.png

To calculate the standard deviation, you

  1. Find the average of all the numbers,

    image3.png

  2. Take each number and subtract the average from it.

  3. Square each of the resulting values.

  4. Add them all up.

  5. Divide by n – 1.

  6. Take the square root.

Percentile

Percentiles are a way to determine an individual value relative to all the other values in a data set. When taking a standardized test, you get an individual raw score and a percentile. If you come in at the 90th percentile, for example, 90 percent of the test scores of all students are the same as or below yours (and 10 percent are above yours). In general, being at the kth percentile means k percent of the data lie at or below that point and (100 – k) percent lie above it.

To calculate a percentile, you

  1. Convert the original value to a standard score by using the z-formula,

    image4.png

    where x is the original value,

    image5.png

    is the population mean of all values, and

    image6.png

    is the population standard deviation of all values.

  2. Use the Z-table to find the corresponding percentile for the standard score.

Margin of error for the sample mean

The margin of error for your sample mean,

image7.png

is the amount you expect the sample mean to vary from sample to sample. The formula for the margin of error for

image8.png

dealing with samples of size 30 or more, is

image9.png

where z* is the standard normal value for the confidence level you want.

To calculate the margin of error for

image10.png

you

  1. Determine the confidence level and find the appropriate z*.

  2. Find the standard deviation

    image11.png

    and the sample size, n.

  3. Multiply z* by

    image12.png

    divided by the square root of n.

Sample size needed

If you want to calculate a confidence interval for the population mean with a certain margin of error, you can figure out the sample size you need before you collect any data. The formula for the sample size for

image13.png

is

image14.png

where z* is the standard normal value for the confidence level, MOE is your desired margin of error, and

image15.png

is the standard deviation. Because

image16.png

is an unknown value that you need, you may have to do a pilot study (small experimental study) to come up with a guess for the value of the standard deviation.

To calculate the sample size for

image17.png

run through the following steps:

  1. Multiply z* times s.

  2. Divide by the desired margin of error, MOE.

  3. Square it.

  4. Round any fractional amount up to the nearest integer (so you achieve your desired MOE or better).

Test statistic for the mean

When conducting a hypothesis test for the population mean, you take the sample mean and find out how far it is from the claimed value in terms of a standard score. The standard score is called the test statistic. The formula for the test statistic for the mean is

image18.png

where

image19.png

is the claimed value for the population mean (the value that sits in the null hypothesis).

To calculate the test statistic for the sample mean for samples of size 30 or more, you

  1. Calculate the sample mean,

    image20.png

    and the sample standard deviation, s.

  2. Take

    image21.png

  3. Calculate the standard error,

    image22.png

  4. Divide your result from Step 2 by the standard error found in Step 3.

Correlation

Sample correlation is a measure of the strength and direction of the linear relationship between two quantitative variables X and Y. It doesn’t measure any other type of relationship, and it doesn’t apply to categorical variables. The formula for correlation is

image23.png

To calculate the correlation, you

  1. Find the mean of all the x values and call it

    image24.png

    Find the mean of all the y values and call it

    image25.png

  2. Find the standard deviation of all the x values and call it sx. Find the standard deviation of all the y values and call it sy.

  3. For each (x, y) pair in the data set, take x minus

    image26.png

    and y minus

    image27.png

    and multiply them together.

  4. Add all these products together to get a sum.

  5. Divide the sum by sx x sy.

  6. Divide the result by n – 1 where n is the number of (x, y) pairs. (This is the same as multiplying by one over n – 1.)

Regression line

After examining a scatterplot between two numerical variables and calculating the sample correlation between the two variables, you might observe a linear relationship between them. In that case, it would be appropriate to estimate a regression line for estimating the value of the response variable (Y) given a value for the explanatory variable (X).

Before calculating the regression line, you need five summary statistics:

  • The mean of the x values

    image28.png

  • The mean of the y values

    image29.png

  • The standard deviation of the x values (denoted sx)

  • The standard deviation of the y values (denoted sy)

  • The correlation between X and Y (denoted r)

So, to calculate the best-fit regression line, you

  1. Find the slope using the formula

    image30.png

  2. Find the y-intercept using the formula

    image31.png

  3. Piece together the results from Steps 1 and 2 to give you the regression line: y = mx + b.