|
Published:
June 7, 2016

Statistics For Dummies

Overview

The fun and easy way to get down to business with statistics

Stymied by statistics? No fear? this friendly guide offers clear, practical explanations of statistical ideas, techniques, formulas, and calculations, with lots of examples that show you how these concepts apply to your everyday life.

Statistics For Dummies shows you how to interpret and critique graphs and charts, determine the odds with probability, guesstimate with confidence using confidence intervals, set up and carry out a hypothesis test, compute statistical formulas, and more.

  • Tracks to a typical first semester statistics course
  • Updated examples resonate with today's students
  • Explanations mirror teaching methods and classroom protocol

Packed with practical advice and real-world problems, Statistics For Dummies gives you everything you need to analyze and interpret data for improved classroom or on-the-job performance.

Read More

About The Author

Deborah J. Rumsey, PhD, is Professor of Statistics and Statistics Education Specialist at The Ohio State University. She is the author of Statistics Workbook For Dummies, Statistics II For Dummies, and Probability For Dummies.

Sample Chapters

statistics for dummies

CHEAT SHEET

Whether you’re studying for an exam or just want to make sense of data around you every day, knowing how and when to use data analysis techniques and formulas of statistics will help.Being able to make the connections between those statistical techniques and formulas is perhaps even more important. It builds confidence when tackling statistical problems and solidifies your strategies for completing statistical projects.

HAVE THIS BOOK?

Articles from
the book

The Empirical Rule (68-95-99.7) says that if the population of a statistical data set has a normal distribution (where the data are in the shape of a bell curve) with population mean µ and standard deviation then following conditions are true: About 68% of the values lie within 1 standard deviation of the mean (or between the mean minus 1 times the standard deviation, and the mean plus 1 times the standard deviation).
How do you select a statistical sample in a way that avoids bias? The key word is random. A random sample is a sample selected by equal opportunity; that is, every possible sample of the same size as yours had an equal chance to be selected from the population. What random really means is that no subset of the population is favored in or excluded from the selection process.
Statistical formulas don’t know whether they are being used properly, and they don’t warn you when your results are incorrect. In order to draw the appropriate conclusions, it’s up to you to avoid overstating your results and you need to find accurate cause-and-effect relationships from your results. Some of the most common mistakes made in conclusions are overstating the results or generalizing the results to a larger group than was actually represented by the study.
Critical values (z*-values) are an important component of confidence intervals (the statistical technique for estimating population parameters). The z*-value, which appears in the margin of error formula, measures the number of standard errors to be added and subtracted in order to achieve your desired confidence level (the percentage confidence you want).
In statistics, every confidence interval (and every margin of error, for that matter) has a percentage associated with it, called a confidence level. This percentage represents how confident you are that the results will capture the true population parameter, depending on the luck of the draw with your random sample.
There are two major types of statistical studies: surveys and experiments. After a question has been formed, researchers must design an effective study to collect data that will help answer that question. This means that they must decide whether to use a survey or experiment to get the data they need. Statistical surveys An observational study is one in which data is collected on individuals in a way that doesn’t affect them.
If you know the standard deviations for two population samples, then you can find a confidence interval (CI) for the difference between their means, or averages.The goal of many statistical surveys and studies is to compare two populations, such as men versus women, low versus high-income families, and Republicans versus Democrats.
After collecting good statistical data, you can summarize it with descriptive statistics. These are numbers that describe a data set in terms of its important features: If the data are categorical (where individuals are placed into groups, such as gender or political affiliation), they are typically summarized using the number of individuals in each group (called the frequency) or the percentage of individuals in each group (called the relative frequency).
You can summarize your statistical data in a visual way using charts and graphs. These are displays that are organized to give you a big picture of the data in a flash and to zoom in on a particular result that was found. In this world of quick information and sound bites, graphs and charts are commonplace. Here are some popular types of graphs and charts: Some of the basic graphs used for categorical data include pie charts and bar graphs, which break down how many responses were given for each group of certain variables, such as gender or which applications are used on teens’ cellphones.
Standard deviation tells you how the values are spread out in a statistical sample. For example, have you heard anyone report that a certain result was found to be “two standard deviations above the mean”? More and more, people want to report how significant their results are, and the number of standard deviations above or below average is one way to do it.
You use hypothesis tests to challenge whether some claim about a population is true (for example, a claim that 40 percent of Americans own a cellphone). To test a statistical hypothesis, you take a sample, collect data, form a statistic, standardize it to form a test statistic (so it can be interpreted on a standard scale), and decide whether the test statistic refutes the claim.
If a statistical data set has a normal distribution, it is customary to standardize all the data to obtain standard scores known as z-values or z-scores. The distribution of z-values takes on a standard normal distribution (or Z-distribution). The standard normal distribution is a special normal distribution with a mean equal to 0 and a standard deviation equal to 1.
One of the most common goals of statistical research is to find links between variables. Using correlation, regression, and two-way tables, you can use data to answer questions like these: Which lifestyle behaviors increase or decrease the risk of cancer? What are the number of side effects associated with this new drug?
A statistical graph can give you a false picture of the statistics on which it is based. For example, it can be misleading through its choice of scale on the frequency/relative frequency axis (that is, the axis where the amounts in each group are reported), and/or its starting value. By using a "stretched out" scale (for example, having each half inch of a bar represent 10 units versus 50 units), you can stretch the truth, make differences look more dramatic, or exaggerate values.
There are no hard and fast rules for how to create a histogram based on a set of statistical data; the person making the graph gets to choose the groupings on the x-axis as well as the scale and starting and ending points on the y-axis. Just because there is an element of choice, however, doesn't mean every choice is appropriate; in fact, a histogram can be made to be misleading in many ways.
A histogram is a special graph applied to statistical data broken down into numerically ordered groups; for example, age groups such as 10–20, 21–30, 31–40, and so on. A histogram provides a snapshot of all the data, making it a quick way to get the big picture of the data, in particular, its general shape. In a histogram, the bars connect to each other — as opposed to a bar graph for categorical data, where the bars represent categories that don't have a particular order, and are separated.
One main staple of research studies is called hypothesis testing. A hypothesis test is a technique for using data to validate or invalidate a claim about a population. For example, a politician may claim that 80% of the people in her state agree with her — is that really true? Or, a company may claim that they deliver pizzas in 30 minutes or less; is that really true?
In statistics, the standard deviation in a population affects the standard error for that population. Standard deviation measures the amount of variation in a population. In the standard error formula you see the population standard deviation, is in the numerator. That means as the population standard deviation increases, the standard error of the sample means also increases.
The size (n) of a statistical sample affects the standard error for that sample. Because n is in the denominator of the standard error formula, the standard error decreases as n increases. It makes sense that having more data gives less variation (and more precision) in your results. Distributions of times for 1 worker, 10 workers, and 50 workers.
In statistics, there are two important ideas regarding sample size and margin of error. First, sample size and margin of error have an inverse relationship. Second, after a point, increasing the sample size beyond what you already have gives you a diminished return, because the increased accuracy will be negligible.
Of all of the misunderstood statistical issues, the one that’s perhaps the most problematic is the misuse of the concepts of correlation and causation. Correlation, as a statistical term, is the extent to which two numerical variables have a linear relationship (that is, a relationship that increases or decreases at a constant rate).
Statistical studies often involve several kinds of experiments: treatment groups, control groups, placebos, and blind and double-blind tests. An experiment is a study that imposes a treatment (or control) to the subjects (participants), controls their environment (for example, restricting their diets, giving them certain dosage levels of a drug or placebo, or asking them to stay awake for a prescribed period of time), and records the responses.
In statistics, when the original distribution for a population X is normal, then you can also assume that the shape of the sampling distribution, or will also be normal, regardless of the sample size n. For example, if you look at the amount of time (X) required for a clerical worker to complete a task, you may find that X had a normal distribution (refer to the lowest curve in the figure).
A pie chart takes categorical data from a statistical sample and breaks them down by group, showing the percentage of individuals that fall into each group. Because a pie chart takes on the shape of a circle, the "slices" that represent each group can easily be compared and contrasted. Because each individual in the study falls into one and only one category, the sum of all the slices of the pie should be 100% or close to it (subject to a bit of rounding off).
In statistics, if a population X has any distribution that is not normal, or if its distribution is unknown, you can’t automatically say the distribution of the sample means has a normal distribution. But incredibly, you can use a normal distribution to approximate the distribution of — if the sample size is large enough.
When you create a statistical time chart, you have to evaluate the units of the numbers being plotted. For example, it's misleading to chart the number of crimes over time, rather than the crime rate (crimes per capita) — because the population size of a city changes over time, crime rate is the appropriate measure.
The normal distribution is used to help measure the accuracy of many statistics, including the sample mean, using an important result called the Central Limit Theorem. This theorem gives you the ability to measure how much the means of various samples will vary, without having to take any other sample means to compare it with.
You can connect the shape of a histogram with the mean and median of the statistical data that you use to create it. Conversely, the relationship between the mean and median can help you predict the shape of the histogram.The preceding graph is a histogram showing the ages of winners of the Best Actress Academy Award; you can see it is skewed right.
You can break categorical data down using two-way tables (also known as contingency tables, cross-tabulations or crosstabs) to summarize statistical information about different groups. Categorical data (also known as qualitative data) capture qualities or characteristics about an individual, such as a person’s eye color, gender, political party, or opinion on some issue (typically using categories such as Agree, Disagree, or No opinion, or some variation of these).
If all you are interested in is where you stand compared to the rest of the herd, you need a statistic that reports relative standing, and that statistic is called a percentile. The kth percentile is a value in a data set that splits the data into two pieces: The lower piece contains k percent of the data, and the upper piece contains the rest of the data (which amounts to [100 – k] percent, because the total amount of data is 100 percent).
By far the most common measure of variation for numerical data in statistics is the standard deviation. The standard deviation measures how concentrated the data are around the mean; the more concentrated, the smaller the standard deviation. It’s not reported nearly as often as it should be, but when it is, you often see it in parentheses, like this: (s = 2.
If you know the standard deviation for a population, then you can calculate a confidence interval (CI) for the mean, or average, of that population. When a statistical characteristic that’s being measured (such as income, IQ, price, height, quantity, or weight) is numerical, most people want to estimate the mean (average) value for the population.
You can calculate a confidence interval (CI) for the mean, or average, of a population even if the standard deviation is unknown or the sample size is small. When a statistical characteristic that’s being measured (such as income, IQ, price, height, quantity, or weight) is numerical, most people want to estimate the mean (average) value for the population.
Can one statistic measure both the strength and direction of a linear relationship between two variables? Sure! Statisticians use the correlation coefficient to measure the strength and direction of the linear relationship between two numerical variables X and Y. The correlation coefficient for a sample of data is denoted by r.
In statistics, you can calculate a regression line for two variables if their scatterplot shows a linear pattern and the correlation between the variables is very strong (for example, r = 0.98). A regression line is simply a single line that best fits the data (in terms of having the smallest overall distance from the line to the points).
When you need to find a statistical sample mean (or average), you also need to report a margin of error, or MOE, for the sample mean. You can also calculate the margin of error of a sample proportion, which is the amount of "successes" in a sample compared to the whole. The general formula for the margin of error for the sample mean (assuming a certain condition is met — see below) isis the population standard deviation, n is the sample size, and z* is the appropriate z*-value for your desired level of confidence (which you can find in the following table).
When you report the results of a statistical survey, you need to include the margin of error. The general formula for the margin of error for a sample proportion (if certain conditions are met) iswhere ρ is the sample proportion, n is the sample size, and z* is the appropriate z*-value for your desired level of confidence (from the following table).
The most common way to summarize a statistical data set is to describe where the center, or mean, is. One way of thinking about what the mean of a data set means is to ask, “What’s a typical value?” Or, “Where is the middle of the data?” The center of a data set can actually be measured in different ways, and the method chosen can greatly influence the conclusions people make about the data.
If you have a statistical sample with a normal distribution, you can plug an x-value for this distribution into a special equation to find its z-value. The z-value can then help you to interpret statistical values such as finding out whether a student's relative standing is better in one class than another. Changing an x-value to a z-value is called standardizing.
The most complex part of interpreting a statistical histogram is to get a handle on what you want to show on the x and y axes. Having good descriptive labels on the axes will help. Most statistical software packages label the x-axis using the variable name you provided when you entered your data (for example, "age" or "weight").
You can compare numerical data for two statistical populations or groups (such as cholesterol levels in men versus women, or income levels for high school versus college grads) to test a claim about the difference in their averages. (For example, is the difference in the population means equal to zero, indicating their means are equal?
For statistical purposes, you can compare two populations or groups when the variable is categorical (smoker/nonsmoker, Democrat/Republican, support/oppose an opinion, and so on) and you’re interested in the proportion of individuals with a certain characteristic, such as the proportion of smokers.In order to make this comparison, two independent (separate) random samples need to be selected, one from each population.
You can find a confidence interval (CI) for the difference between the means, or averages, of two population samples, even if the population standard deviations are unknown and/or the sample sizes are small. The goal of many statistical surveys and studies is to compare two populations, such as men versus women, low versus high income families, and Republicans versus Democrats.
In statistics, a random variable is a characteristic, measurement, or count that changes randomly according to a certain set or pattern. Random variables are usually denoted with capital letters such as X, Y, Z, and so on. In math you have variables like X and Y that take on certain values depending on the problem (for example, the width of a rectangle), but in statistics the variables change in a random way.
You can find the confidence interval (CI) for a population proportion to show the statistical probability that a characteristic is likely to occur within the population.When a characteristic being measured is categorical — for example, opinion on an issue (support, oppose, or are neutral), gender, political party, or type of behavior (do/don’t wear a seatbelt while driving) — most people want to estimate the proportion (or percentage) of people in the population that fall into a certain category of interest.
The margin of error of a confidence interval (CI) is affected by the size of the statistical sample; as the size increases, the margin of error decreases. Looking at this the other way around, if you want a smaller margin of error (and doesn’t everyone?), you need a larger sample size.Suppose you are getting ready to do your own survey to estimate a population mean; wouldn’t it be nice to see ahead of time what sample size you need to get the margin of error you want?
To estimate the difference between two population proportions with a confidence interval, you can use the Central Limit Theorem when the sample sizes are large enough (typically, each at least 30). When a statistical characteristic, such as opinion on an issue (support/don’t support), of the two groups being compared is categorical, people want to report on the differences between the two population proportions — for example, the difference between the proportion of women and men who support a four-day work week.
After you identify that a random variable X has a binomial distribution, you'll likely want to find probabilities for X. The good news is that you don't have to find them from scratch; you get to use established statistical formulas for finding binomial probabilities, using the values of n and p unique to each problem.
Hypothesis tests are used to test the validity of a claim that is made about a population. This claim that’s on trial, in essence, is called the null hypothesis (H0). The alternative hypothesis (Ha) is the one you would believe if the null hypothesis is concluded to be untrue. Learning how to find the p-value in statistics is a fundamental skill in testing, helping you weigh the evidence against the null hypothesis.
When you want to find percentiles for a t-distribution, you can use the t-table. A percentile is a number on a statistical distribution whose less-than probability is the given percentage; for example, the 95th percentile of the t-distribution with n – 1 degrees of freedom is that value of whose left-tail (less-than) probability is 0.
You can use the z-table to find a full set of "less-than" probabilities for a wide range of z-values. To use the z-table to find probabilities for a statistical sample with a standard normal (Z-) distribution, follow the steps below. Using the Z-table Go to the row that represents the ones digit and the first digit after the decimal point (the tenths digit) of your z-value.
In statistics, you can easily find probabilities for a sample mean if it has a normal distribution. Even if it doesn’t have a normal distribution, or the distribution is not known, you can find probabilities if the sample size, n, is large enough.The normal distribution is a very friendly distribution that has a table for finding probabilities and anything else you need.
You can find probabilities for a sample proportion by using the normal approximation as long as certain conditions are met. For example, say that a statistical study claims that 0.38 or 38% of all the students taking the ACT test would like math help. Suppose you take a random sample of 100 students. What is the probability that more than 45 of them say they need math help?
You can use a t-table to find right-tail probabilities and p-values for hypothesis tests and to find t*-values (critical values) for a confidence interval involving t. Unlike a Z-distribution, a t-distribution is not classified by its mean and standard deviation, but by the sample size of the data set being used (n).
If your statistical sample has a normal distribution (X), then you can use the Z-table to find the probability that something will occur within a defined set of parameters. For example, you could look at the distribution of fish lengths in a pond to determine how likely you are to catch a certain length of fish.
A popular normal distribution problem involves finding percentiles for X. That is, you are given the percentage or statistical probability of being at or below a certain x-value, and you have to find the x-value that corresponds to it. For example, if you know that the people whose golf scores were in the lowest 10% got to go to a tournament, you may wonder what the cutoff score was; that score would represent the 10th percentile.
Confidence intervals estimate population parameters, such as the population mean, by using statistics (for example, the sample mean) plus or minus a margin of error (MOE). To compute the margin of error for a confidence interval, you need a critical value (the number of standard errors you add and subtract to get the margin of error you want).
In statistics, if you want to draw conclusions about a null hypothesis H0 (reject or fail to reject) based on a p-value, you need to set a predetermined cutoff point where only those p-values less than or equal to the cutoff will result in rejecting H0. While 0.05 is a very popular cutoff value for rejecting H0, cutoff points and resulting decisions can vary — some people use stricter cutoffs, such as 0.
To obtain a measure of variation based on the five-number summary of a statistical sample, you can find what's called the interquartile range, or IQR. The purpose of the five-number summary is to give descriptive statistics for center, variation, and relative standing all in one shot. The measure of center in the five-number summary is the median, and the first quartile, median, and third quartiles are measures of relative standing.
Because the binomial distribution is so commonly used, statisticians went ahead and did all the grunt work to figure out nice, easy formulas for finding its mean, variance, and standard deviation. The following results are what came out of it. If X has a binomial distribution with n trials and probability of success p on each trial, then: The mean of X is The variance of X is The standard deviation of X is For example, suppose you flip a fair coin 100 times and let X be the number of heads; then X has a binomial distribution with n = 100 and p = 0.
The median is a statistic that is commonly used to measure the center of a data set. However, it is still an unsung hero of statistics in the sense that it isn’t used nearly as often as it should be, although people are beginning to report it more nowadays. The median of a data set is the value that lies exactly in the middle when the data have been ordered from the lowest value to the greatest value.
If you are working from a large statistical sample, then solving problems using the binomial distribution might seem daunting. However, there's actually a very easy way to approximate the binomial distribution, as shown in this article. Here's an example: suppose you flip a fair coin 100 times and you let X equal the number of heads.
If you use a large enough statistical sample size, you can apply the Central Limit Theorem (CLT) to a sample proportion for categorical data to find its sampling distribution. The population proportion, p, is the proportion of individuals in the population who have a certain characteristic of interest (for example, the proportion of all Americans who are registered voters, or the proportion of all teenagers who own cellphones).
If your data create a histogram that is not bell-shaped, you can use a set of statistics that is based on percentiles to describe the big picture of the data. Called the five-number summary, this method involves cutting the data into four equal pieces and reporting the resulting cutoff points that separate these pieces.
When you create a histogram, it's important to group the data sets into ranges that let you see meaningful patterns in your statistical data. For example, say you want to see if actresses who have won an Academy Award were likely to be within a certain age range. The following image shows a histogram of Best Actress Academy Award winners' ages between 1928 and 2009.
Sometimes the mean versus median debate can get quite interesting (even beyond mathematician standards). Especially when you look at the skewness and symmetry of your statistical data in a histogram.For example, suppose you’re part of an NBA team trying to negotiate salaries. If you represent the owners, you want to show how much everyone is making and how much money you’re spending, so you want to take into account those superstar players and report the average.
Bias is a word you hear all the time in statistics, and you probably know that it means something bad. But what really constitutes bias? Bias is systematic favoritism that is present in the data collection process, resulting in lopsided, misleading results. Bias can occur in any of a number of ways: In the way the sample is selected.
The most well-known and loved discrete random variable in statistics is the binomial. Binomial means two names and is associated with situations involving two outcomes; for example yes/no, or success/failure (hitting a red light or not, developing a side effect or not). A binomial variable has a binomial distribution.
In statistics, a sampling distribution is based on sample averages rather than individual outcomes. This makes it different from a distribution. Here’s why: A random variable is a characteristic of interest that takes on certain values in a random manner. For example, the number of red lights you hit on the way to work or school is a random variable; the number of children a randomly selected family has is a random variable.
Two of the most important terms in statistics are mean and variance, and so you need to be able to identify their notations when working with discrete random variables. The mean of a random variable is the average of all the outcomes you would expect in the long term (over all possible samples). For example, if you roll a die a billion times and record the outcomes, the average of those outcomes is expected to be 3.
A discrete random variable X can take on a certain set of possible outcomes, and each of those outcomes has a certain statistical probability of occurring. The notation used for any specific outcome is a lowercase x, and the probability of any specific outcome occurring is denoted p(x), which you pronounce "p of x" or "probability of x.
Standard deviation can be difficult to interpret as a single number on its own. Basically, a small standard deviation means that the values in a statistical data set are close to the mean (or average) of the data set, and a large standard deviation means that the values in the data set are farther away from the mean.
In statistics, once you have calculated the slope and y-intercept to form the best-fitting regression line in a scatterplot, you can then interpret their values. Interpreting the slope of a regression line The slope is interpreted in algebra as rise over run. If, for example, the slope is 2, you can write this as 2/1 and say that as you move along the line, as the value of the X variable increases by 1, the value of the Y variable increases by 2.
Scatterplots are useful for interpreting trends in statistical data. Each observation (or point) in a scatterplot has two coordinates. The first corresponds to the first piece of data in the pair (that’s the X coordinate; the amount that you go left or right). The second coordinate corresponds to the second piece of data in the pair (that’s the Y-coordinate; the amount that you go up or down).
A bar graph (or bar chart) is perhaps the most common statistical data display used by the media. A bar graph breaks categorical data down by group, and represents these amounts by using bars of different lengths. It uses either the number of individuals in each group (also called the frequency) or the percentage in each group (called the relative frequency).
You’ve probably heard or seen results like this: “This statistical survey had a margin of error of plus or minus 3 percentage points.” What does this mean? Most surveys are based on information collected from a sample of individuals, not the entire population (as a census would be). A certain amount of error is bound to occur — not in the sense of calculation error (although there may be some of that, too) but in the sense of sampling error, which is the error that occurs simply because the researchers aren’t asking everyone.
One of the features that a histogram can show you is the shape of the statistical data — in other words, the manner in which the data fall into groups. For example, all the data may be exactly the same, in which case the histogram is just one tall bar; or the data might have an equal number in each group, in which case the shape is flat.
A boxplot is a one-dimensional graph of numerical data based on the five-number summary. This summary includes the following statistics: the minimum value, the 25th percentile (known as Q1), the median, the 75th percentile (Q3), and the maximum value. In essence, these five descriptive statistics divide the data set into four parts, where each part contains 25% of the data.
When you create a histogram, you need to divide the data set into separate groups. However, some statistical data may be right on the borderline between two groups. What do you do in these situations? Take a look at the following table showing Best Actress Oscar Award winners between 1928 and 1935: Ages of Best Actress Oscar Award Winners 1928–1935 Year Winner Age Movie 1928 Laura Gainor 22 Sunrise 1929 Mary Pickford 37 Coquette 1930 Norma Shearer 30 The Divorcee 1931 Marie Dressler 62 Min and Bill 1932 Helen Hayes 32 The Sin of Madelon Claudet 1933 Katharine Hepburn 26 Morning Glory 1934 Collette Colbert 31 It Happened One Night 1935 Bette Davis 27 Dangerous Did you notice that one actress’s age lies right on a borderline?
When you set up a hypothesis test to determine the validity of a statistical claim, you need to define both a null hypothesis and an alternative hypothesis. Typically in a hypothesis test, the claim being made is about a population parameter (one number that characterizes the entire population). Because parameters tend to be unknown quantities, everyone wants to make claims about what their values may be.
After examining the design of the study and how data was collected, the next thing to do when you come upon a statistic or the result of a statistical study is to look for mathematical errors in the data. Start by asking, “Is this number correct?” Don’t assume it is! You’d probably be surprised at the number of simple arithmetic errors that occur when statistics are collected, summarized, reported, or interpreted.
You can get a sense of variability in a statistical data set by looking at its histogram. For example, if the data are all the same, they are all placed into a single bar, and there is no variability. If an equal amount of data is in each of several groups, the histogram looks flat with the bars close to the same height; this signals a fair amount of variability.
In order to know when a random variable in a statistical sample does not have a binomial distribution, you first have to know what makes it binomial. You can identify a random variable as being binomial if the following four conditions are met: There are a fixed number of trials (n). Each trial has two possible outcomes: success or failure.
Although the normal (Z-) distribution and t-distribution are similar, they look different from each other and are used for different statistical purposes. The normal distribution is that well-known bell-shaped distribution whose mean is and whose standard deviation is The standard normal (or Z-distribution), is the most common normal distribution, with a mean of 0 and standard deviation of 1.
You can use a hypothesis test to examine or challenge a statistical claim about a population mean if the variable is numerical (for example, age, income, time, and so on) and only one population or group (such as all U.S. households or all college students) is being studied. For example, a child psychologist says that the average time that working mothers spend talking to their children is 11 minutes per day, on average.
You can use a hypothesis test to test a statistical claim about a population proportion when the variable is categorical (for example, gender or support/oppose) and only one population or group is being studied (for example, all registered voters). The test looks at the proportion (p) of individuals in the population who have a certain characteristic — for example, the proportion of people who carry cellphones.
You can test for an average difference using the paired t-test when the variable is numerical (for example, income, cholesterol level, or miles per gallon) and the individuals in the statistical sample are either paired up in some way according to relevant variables such as age or perhaps weight, or the same people are used twice (for example, using a pre-test and post-test).
When using a test statistic for one population mean, there are two cases where you must use the t-distribution instead of the Z-distribution. The first case is where the sample size is small (below 30 or so), and the second case is when the population standard deviation, is not known, and you have to estimate it using the sample standard deviation, s.
After a statistical study has been designed, be it a survey or an experiment, you need to select a sample of individuals who represent a cross-section of the entire population. This is critical to producing credible data in the end. Statisticians have a saying, “Garbage in equals garbage out.” If you select your subjects (the individuals who will participate in your study) in a way that is biased — that is, favoring certain individuals or groups of individuals — then your results will also be biased.
Statistical bias is the systematic favoritism of certain individuals or certain responses in a study. Bias is the nemesis of statisticians, and they do everything they can to avoid it. Want an example of bias? Say you’re conducting a phone survey on job satisfaction of Americans; if you call people at home during the day between 9 a.
If a time chart includes too much statistical data, the result can be so complex that it makes it impossible to interpret the data. By reducing the amount of data, it is easier to see patterns emerge from the data. A chart of the time between eruptions for Old Faithful geyser in Yellowstone Park is shown in the following figure.
A statistical distribution is a listing of the possible values of a variable (or intervals of values), and how often (or at what density) they occur. It can take several forms, including binomial, normal, and t-distribution. A variable is a characteristic that's being counted, measured, or categorized. Examples include gender, age, height, weight, or number of pets you own.
When designing a study, the sample size is an important consideration because the larger the sample size, the more data you have, and the more precise your results will be (assuming high-quality data). If you know the level of precision you want (that is, your desired margin of error), you can calculate the sample size needed to achieve it.
Whether you’re studying for an exam or just want to make sense of data around you every day, knowing how and when to use data analysis techniques and formulas of statistics will help.Being able to make the connections between those statistical techniques and formulas is perhaps even more important. It builds confidence when tackling statistical problems and solidifies your strategies for completing statistical projects.
Several commonly used tables in statistics include the Z-table, the t-table, the binomial table, and a table of z*-values for selected confidence levels. Excerpted from Statistics For Dummies, 2nd Edition, by Deborah J. Rumsey, PhD (2011, Wiley). This material is reproduced with permission of John Wiley & Sons, Inc.
In statistics, numerical random variables represent counts and measurements. They come in two different flavors: discrete and continuous, depending on the type of outcomes that are possible: Discrete random variables. If the possible outcomes of a random variable can be listed out using a finite (or countably infinite) set of single numbers (for example, {0, 1, 2 .
In statistics, a confidence interval is an educated guess about some characteristic of the population. A confidence interval contains an initial estimate plus or minus a margin of error (the amount by which you expect your results to vary, if a different sample were taken). The following table shows formulas for the components of the most common confidence intervals and keys for when to use them.
When working with statistics, it’s important to recognize the different types of data: numerical (discrete and continuous), categorical, and ordinal.Data are the actual pieces of information that you collect through your study. For example, if you ask five of your friends how many pets they own, they might give you the following data: 0, 2, 1, 4, 18.
After data has been collected, the first step in analyzing it is to crunch out some descriptive statistics to get a feeling for the data. For example: Where is the center of the data located? How spread out is the data? How correlated are the data from two variables? The most common descriptive statistics are in the following table, along with their formulas and a short description of what each one measures.
Statistical researchers often use a linear relationship to predict the (average) numerical value of Y for a given value of X using a straight line (called the regression line).If you know the slope and the y-intercept of that regression line, then you can plug in a value for X and predict the average value for Y.
One very special member of the normal distribution family is called the standard normal distribution, or Z-distribution. In statistics, the Z-distribution is used to help find probabilities and percentiles for regular normal distributions (X). It serves as the standard by which all other normal distributions are measured.
In statistics, r value correlation means correlation coefficient, which is the statistical measure of the strength of a linear relationship between two variables. If that sounds complicated, don't worry — it really isn't, and I will explain it farther down in this article. But before we get into r values, there's some background information you should understand first.
Percentiles report the relative standing of a particular value within a statistical data set. If that’s what you’re most interested in, the actual mean and standard deviation of the data set are not important, and neither is the actual data value. What’s important is where you stand — not in relation to the mean, but in relation to everyone else: That’s what a percentile gives you.
A boxplot can give you information regarding the shape, variability, and center (or median) of a statistical data set. Also known as a box and whisker chart, boxplots are particularly useful for displaying skewed data. Statistical data also can be displayed with other charts and graphs. What the boxplot shape reveals about a statistical data set A boxplot can show whether a data set is symmetric (roughly the same on each side when cut down the middle) or skewed (lopsided).
A time chart (also called a line graph) is a statistical display used to examine trends in data over time (also known as time series data). Time charts show time on the x-axis (for example, by month, year, or day) and the values of the variable being measured on the y-axis (like birth rates, total sales, or population size).
When you perform a hypothesis test in statistics, a p-value helps you determine the significance of your results. Hypothesis tests are used to test the validity of a claim that is made about a population. This claim that’s on trial, in essence, is called the null hypothesis. The alternative hypothesis is the one you would believe if the null hypothesis is concluded to be untrue.
The distribution of a statistical data set (or a population) is a listing or function showing all the possible values (or intervals) of the data and how often they occur. When a distribution of categorical data is organized, you see the number or percentage of individuals in each group. When a distribution of numerical data is organized, they’re often ordered from smallest to largest, broken into reasonably sized groups (if appropriate), and then put into graphs and charts to examine the shape, center, and amount of variability in the data.
If you read statistical survey results without knowing the margin of error, or MOE, you are only getting part of the story. Survey results themselves (with no MOE) are only a measure of how the sample of selected individuals felt about the issue; they don’t reflect how the entire population may have felt, had they all been asked.
Statistical results should always include a margin of error and confidence intervals. This information is important because you often see statistics that try to estimate numbers pertaining to an entire population based on a survey of only a part of the population; in fact, you see these statistics almost every day in the form of survey results.
In statistics, the average and the median are two different representations of the center of a data set and can often give two very different stories about the data, especially when the data set contains outliers.The mean, also referred to by statisticians as the average, is the most common statistic used to measure the center of a numerical data set.
The standard deviation is a commonly used statistic, but it doesn’t often get the attention it deserves. Although the mean and median are out there in common sight in the everyday media, you rarely see them accompanied by any measure of how diverse that data set was, and so you are getting only part of the story.
For virtually any statistical study of a population, you have to center your attention on a particular group of individuals (for example, a group of people, cities, animals, rock specimens, exam scores, and so on). For example: What do Americans think about the president’s foreign policy? What percentage of planted crops in Wisconsin did deer destroy last year?
A histogram gives you a rough idea of where the "center" of the data lies. The word center is in quotes because many different statistics are used to designate the center. The two most common measures of center are the average (the mean) and the median. To visualize the average age (the mean), picture the data as people sitting in various places on a teeter-totter (aka seesaw).
To explore the links between two categorical variables, you first need to organize the data that’s been collected, and a table is a great way to do that. A two-way table classifies individuals into groups based on the outcomes, or distributions, of two categorical variables (for example, gender and opinion). Suppose your local community developers are building a campground, and they’ve decided pets will be allowed as long as they’re on a leash.
https://cdn.prod.website-files.com/6630d85d73068bc09c7c436c/69195ee32d5c606051d9f433_4.%20All%20For%20You.mp3

Frequently Asked Questions

No items found.