Statistics Articles
Ace your stats class, analyze data for work, or play the odds at the slot machines. Everything you need is in here.
Articles From Statistics
Filter Results
Article / Updated 03-15-2022
If you know the standard deviation for a population, then you can calculate a confidence interval (CI) for the mean, or average, of that population. When a statistical characteristic that’s being measured (such as income, IQ, price, height, quantity, or weight) is numerical, most people want to estimate the mean (average) value for the population. You estimate the population mean, μ, by using a sample mean, x̄, plus or minus a margin of error. The result is called a confidence interval for the population mean, μ. When the population standard deviation is known, the formula for a confidence interval (CI) for a population mean is x̄ ± z* σ/√n, where x̄ is the sample mean, σ is the population standard deviation, n is the sample size, and z* represents the appropriate z*-value from the standard normal distribution for your desired confidence level. z*-values for Various Confidence Levels Confidence Level z*-value 80% 1.28 90% 1.645 (by convention) 95% 1.96 98% 2.33 99% 2.58 The above table shows values of z* for the given confidence levels. Note that these values are taken from the standard normal (Z-) distribution. The area between each z* value and the negative of that z* value is the confidence percentage (approximately). For example, the area between z*=1.28 and z=-1.28 is approximately 0.80. Hence this chart can be expanded to other confidence percentages as well. The chart shows only the confidence percentages most commonly used. In this case, the data either have to come from a normal distribution, or if not, then n has to be large enough (at least 30 or so) in order for the Central Limit Theorem to be applied, allowing you to use z*-values in the formula. To calculate a CI for the population mean (average), under these conditions, do the following: Determine the confidence level and find the appropriate z*-value. Refer to the above table. Find the sample mean (x̄) for the sample size (n). Note: The population standard deviation is assumed to be a known value, σ. Multiply z* times σ and divide that by the square root of n. This calculation gives you the margin of error. Take x̄ plus or minus the margin of error to obtain the CI. The lower end of the CI is x̄ minus the margin of error, whereas the upper end of the CI is x̄ plus the margin of error. For example, suppose you work for the Department of Natural Resources and you want to estimate, with 95 percent confidence, the mean (average) length of all walleye fingerlings in a fish hatchery pond. Because you want a 95 percent confidence interval, your z*-value is 1.96. Suppose you take a random sample of 100 fingerlings and determine that the average length is 7.5 inches; assume the population standard deviation is 2.3 inches. This means x̄ = 7.5, σ = 2.3, and n = 100. Multiply 1.96 times 2.3 divided by the square root of 100 (which is 10). The margin of error is, therefore, ± 1.96(2.3/10) = 1.96*0.23 = 0.45 inches. Your 95 percent confidence interval for the mean length of walleye fingerlings in this fish hatchery pond is 7.5 inches ± 0.45 inches. (The lower end of the interval is 7.5 – 0.45 = 7.05 inches; the upper end is 7.5 + 0.45 = 7.95 inches.) After you calculate a confidence interval, make sure you always interpret it in words a non-statistician would understand. That is, talk about the results in terms of what the person in the problem is trying to find out — statisticians call this interpreting the results “in the context of the problem.” In this example you can say: “With 95 percent confidence, the average length of walleye fingerlings in this entire fish hatchery pond is between 7.05 and 7.95 inches, based on my sample data.” (Always be sure to include appropriate units.)
View ArticleCheat Sheet / Updated 02-25-2022
This cheat sheet is for you to use as a quick resource for finding important basic statistical formulas such as mean, standard deviation, and Z-values; important and always useful probability definitions such as independence and rules such as the multiplication rule and the addition rule; and 10 quick ways to spot statistical mistakes either in your own work, or out there in the media as a consumer of statistical information.
View Cheat SheetCheat Sheet / Updated 02-23-2022
Statistics II elaborates on Statistics I and moves into new territories, including multiple regression, analysis of variance (ANOVA), Chi-square tests, nonparametric procedures, and other key topics. Knowing which data analysis to use and why is important, as is familiarity with computer output if you want your numbers to give you dependable results.
View Cheat SheetCheat Sheet / Updated 02-14-2022
SPSS is an application that performs statistical analysis on data. Entering and manipulating information in the application can be done by using SPSS’s proprietary language, which is known as the Syntax command language, or more commonly, as Syntax. The language is quite like other programming languages, and it allows you to define variables (or use predefined ones), and to use them within statements, or to evaluate them with relational or logical operators. Good programmers always know to make their code accessible through the use of comments. Syntax can also be used in conjunction with Basic and Python.
View Cheat SheetCheat Sheet / Updated 01-28-2022
There are many types of statistics problems, including the use of pie charts, bar graphs, means, standard deviation to correlation, regression, confidence intervals, and hypothesis tests. To be successful, you need to be able to make connections between statistical ideas and statistical formulas. Through practice, you see what type of technique is required for a problem and why, as well as how to set up the problem, work it out, and make proper conclusions. Most statistics problems you encounter likely involve terminology, symbols, and formulas. No worries! This Cheat Sheet gives you tips for success.
View Cheat SheetArticle / Updated 12-28-2021
In statistics, you can easily find probabilities for a sample mean if it has a normal distribution. Even if it doesn’t have a normal distribution, or the distribution is not known, you can find probabilities if the sample size, n, is large enough. The normal distribution is a very friendly distribution that has a table for finding probabilities and anything else you need. For example, you can find probabilities for by converting the to a z-value and finding probabilities using the Z-table (see below). The general conversion formula from Substituting the appropriate values of the mean and standard error of the conversion formula becomes: Don’t forget to divide by the square root of n in the denominator of z. Always divide by the square root of n when the question refers to the average of the x-values. For example, suppose X is the time it takes a randomly chosen clerical worker in an office to type and send a standard letter of recommendation. Suppose X has a normal distribution, and assume the mean is 10.5 minutes and the standard deviation 3 minutes. You take a random sample of 50 clerical workers and measure their times. What is the chance that their average time is less than 9.5 minutes? This question translates to finding As X has a normal distribution to start with, you know also has an exact (not approximate) normal distribution. Converting to z, you get: So you want P(Z < –2.36). Using the above Z-table, you find that P(Z < –2.36)=0.0091. So the probability that a random sample of 50 clerical workers average less than 9.5 minutes to complete this task is 0.91% (very small). How do you find probabilities for if X is not normal, or unknown? As a result of the Central Limit Theorem (CLT), the distribution of X can be non-normal or even unknown and as long as n is large enough, you can still find approximate probabilities for using the standard normal (Z-)distribution and the process described above. That is, convert to a z-value and find approximate probabilities using the Z-table. When you use the CLT to find a probability for (that is, when the distribution of X is not normal or is unknown), be sure to say that your answer is an approximation. You also want to say the approximate answer should be close because you’ve got a large enough n to use the CLT. (If n is not large enough for the CLT, you can use the t-distribution in many cases.)
View ArticleArticle / Updated 12-21-2021
Statistical researchers often use a linear relationship to predict the (average) numerical value of Y for a given value of X using a straight line (called the regression line). If you know the slope and the y-intercept of that regression line, then you can plug in a value for X and predict the average value for Y. In other words, you predict (the average) Y from X. If you establish at least a moderate correlation between X and Y through both a correlation coefficient and a scatterplot, then you know they have some type of linear relationship. Never do a regression analysis unless you have already found at least a moderately strong correlation between the two variables. (A good rule of thumb is it should be at or beyond either positive or negative 0.50.) If the data don’t resemble a line to begin with, you shouldn’t try to use a line to fit the data and make predictions (but people still try). Before moving forward to find the equation for your regression line, you have to identify which of your two variables is X and which is Y. When doing correlations, the choice of which variable is X and which is Y doesn’t matter, as long as you’re consistent for all the data. But when fitting lines and making predictions, the choice of X and Y does make a difference. So how do you determine which variable is which? In general, Y is the variable that you want to predict, and X is the variable you are using to make that prediction. For example, say you are using the number of times a population of crickets chirp to predict the temperature. In this case you would make the variable Y the temperature, and the variable X the number of chirps. Hence Y can be predicted by X using the equation of a line if a strong enough linear relationship exists. Statisticians call the X-variable (cricket chirps in this example) the explanatory variable, because if X changes, the slope tells you (or explains) how much Y is expected to change in response. Therefore, the Y variable is called the response variable. Other names for X and Y include the independent and dependent variables, respectively. In the case of two numerical variables, you can come up with a line that enables you to predict Y from X, if (and only if) the following two conditions are met: The scatterplot must form a linear pattern. The correlation, r, is moderate to strong (typically beyond 0.50 or –0.50). Some researchers actually don’t check these conditions before making predictions. Their claims are not valid unless the two conditions are met. But suppose the correlation is high; do you still need to look at the scatterplot? Yes. In some situations the data have a somewhat curved shape, yet the correlation is still strong; in these cases making predictions using a straight line is still invalid. Predictions in these cases need to be made based on other methods that use a curve instead.
View ArticleArticle / Updated 12-21-2021
One of the features that a histogram can show you is the shape of the statistical data — in other words, the manner in which the data fall into groups. For example, all the data may be exactly the same, in which case the histogram is just one tall bar; or the data might have an equal number in each group, in which case the shape is flat. Some data sets have a distinct shape. Here are three shapes that stand out: Symmetric. A histogram is symmetric if you cut it down the middle and the left-hand and right-hand sides resemble mirror images of each other: Skewed right. A skewed right histogram looks like a lopsided mound, with a tail going off to the right: Skewed left. If a histogram is skewed left, it looks like a lopsided mound with a tail going off to the left: Following, are some particulars about classifying the shape of a data set: Don't expect symmetric data to have an exact and perfect shape. Data hardly ever fall into perfect patterns, so you have to decide whether the data shape is close enough to be called symmetric. If the differences aren't significant enough, you can classify it as symmetric or roughly symmetric. Otherwise, you classify the data as non-symmetric. Don't assume that data are skewed if the shape is non-symmetric. Data sets come in all shapes and sizes, and many of them don't have a distinct shape at all. Skewness is mentioned here because it's one of the more common non-symmetric shapes, and it's one of the shapes included in a standard introductory statistics course. If a data set does turn out to be skewed (or close to it), make sure to denote the direction of the skewness (left or right).
View ArticleArticle / Updated 10-27-2021
The t-table (for the t-distribution) is different from the z-table (for the z-distribution). Make sure you understand the values in the first and last rows. Finding probabilities for various t-distributions, using the t-table, is a valuable statistics skill. Use the t-table as necessary to solve the following sample problems below. Sample questions For a study involving one population and a sample size of 18 (assuming you have a t-distribution), what row of the t-table will you use to find the right-tail (“greater than”) probability affiliated with the study results? Answer: df = 17 The study involving one population and a sample size of 18 has n – 1 = 18 – 1 = 17 degrees of freedom. For a study involving a paired design with a total of 44 observations, with the results assuming a t-distribution, what row of the table will you use to find the probability affiliated with the study results? Answer: df = 21 A matched-pairs design with 44 total observations has 22 pairs. The degrees of freedom is one less than the number of pairs: n – 1 = 22 – 1 = 21. A t-value of 2.35, from a t-distribution with 14 degrees of freedom, has an upper-tail (“greater than”) probability between which two values on the t-table? Answer: 0.025 and 0.01 Using the t-table, locate the row with 14 degrees of freedom and look for 2.35. However, this exact value doesn’t lie in this row, so look for the values on either side of it: 2.14479 and 2.62449. The upper-tail probabilities appear in the column headings; the column heading for 2.14479 is 0.025, and the column heading for 2.62449 is 0.01. Hence, the upper-tail probability for a t-value of 2.35 must lie between 0.025 and 0.01.
View ArticleArticle / Updated 10-21-2021
You can use the Z-score table to find a full set of "less-than" probabilities for a wide range of z-values using the z-score formula. Below you will find both the positive z-score and negative z-score table. In figuring out statistics problems, make sure you understand how to use the Z-table to find the probabilities you want. Z Score Table Sample Problems Use these sample z-score math problems to help you learn the z-score formula. What is P (Z ≤ 1.5) ? Answer: 0.9332 To find the answer using the Z-table, find where the row for 1.5 intersects with the column for 0.00; this value is 0.9332. The Z-table shows only "less than" probabilities so it gives you exactly what you need for this question. Note: No probability is exactly at one single point, so: P (Z ≤ 1.5) = P (Z < 1.5) What is P (Z ≥ 1.5) ? Answer: 0.0668 Use the Z-table to find where the row for 1.5 intersects with the column for 0.00, which is 0.9332. Because the Z-table gives you only "less than" probabilities, subtract P(Z < 1.5) from 1 (remember that the total probability for the normal distribution is 1.00, or 100%): P (Z ≥ 1.5) = 1 – P (Z < 1.5) = 1 – 0.9332 = 0.0668 What is P (–0.5 ≤ Z ≤ 1.0) ? Answer: 0.5328 To find the probability that Z is between two values, use the Z-table to find the probabilities corresponding to each z-value, and then find the difference between the probabilities. Here, you want the probability that Z is between –0.5 and 1.0. First, use the Z-table to find the value where the row for –0.5 intersects with the column for 0.00, which is 0.3085. Then, find the value where the row for 1.0 intersects with the column for 0.00, which is 0.8413. Because the Z-table gives you only "less than" probabilities, find the difference between the probability less than 1.0 and the probability less than –0.5: P (–0.5 ≤ Z ≤ 1.0) = P (Z ≤ 1.0) – P (Z ≤ –0.50) = 0.8413 – 0.3085 = 0.5328 What is P (–1.0 ≤ Z ≤ 1.0) ? Answer: 0.6826 To find the probability that Z is between two values, use the Z-table to find the probabilities corresponding to each z-value, and then find the difference between the probabilities. Here, you want the probability that Z is between –1.0 and 1.0. First, use the Z-table to find the value where the row for –1.0 intersects with 0.00, which is 0.1587. Then, find the value where the row for 1.0 intersects with the column for 0.00, which is 0.8413. Because the Z-table gives you only "less than" probabilities, find the difference between probability less than 1.0 and the probability less than –1.0: P (–1.0 ≤ Z ≤ 1.0) = P (Z ≤ 1.0) – P (Z ≤ –1.0) = 0.8413 – 0.1587 = 0.6826
View Article