9 Statistical and Graphical Tips and Traps
The world of statistics is full of pitfalls, but it’s also full of opportunities. Whether you’re a user of statistics or someone who has to interpret them, it’s possible to fall into the pitfalls. It’s also possible to walk around them. Here are ten tips and traps from the areas of hypothesis testing, regression, correlation, and graph.
Significant doesn’t always mean important
Significance is, in many ways, a poorly chosen term. When a statistical test yields a significant result, and the decision is to reject H0, that doesn’t guarantee that the study behind the data is an important one. Statistics can only help decision making about numbers and inferences about the processes that produced them. They can’t make those processes important or earth shattering. Importance is something you have to judge for yourself — and no statistical test can do that for you.
Regression isn’t always linear
When trying to fit a regression model to a scatterplot, the temptation is to immediately use a line. This is the best-understood regression model, and when you get the hang of it, slopes and intercepts aren’t all that daunting.
But linear regression isn’t the only kind of regression. It’s possible to fit a curve through a scatterplot. Don’t be fooled: The statistical concepts behind curvilinear regression are more difficult to understand than the concepts behind linear regression.
It’s worth taking the time to master those concepts, however. Sometimes, a curve is a much better fit than a line.
Extrapolating beyond a sample scatterplot is a bad idea
Whether you’re working with linear regression or curvilinear regression, keep in mind that it’s inappropriate to generalize beyond the boundaries of the scatterplot.
Suppose you’ve established a solid predictive relationship between a test of mathematics aptitude and performance in mathematics courses, and your scatterplot covers only a narrow range of mathematics aptitude. You have no way of knowing whether the relationship holds up beyond that range. Predictions outside that range aren’t valid.
Your best bet is to expand the scatterplot by testing more people. You might find that the original relationship tells only part of the story.
Careful analysis of residuals (the differences between observed and predicted values) can tell you a lot about how well the line fits the data. A foundational assumption is that variability around a regression line is the same up and down the line. If it isn’t, the model might not be as predictive as you think. If the variability is systematic (greater variability at one end than at the other), curvilinear regression might be more appropriate than linear. The standard error of estimate won’t always be the indicator.
A sample can be too large
Believe it or not, this sometimes happens with correlation coefficients. A very large sample can make a small correlation coefficient statistically significant.
But what does that correlation coefficient really mean? The coefficient of determination —r2 — is just .038, meaning that the SSRegression is less than 4 percent of the SSTotal. That’s a very small association.
Bottom line: When looking at a correlation coefficient, be aware of the sample size. If it’s large enough, it can make a trivial association turn out statistically significant. (Hmmm . . . significance — there it is again!)
Consumers: Know your axes
When you look at a graph, make sure you know what’s on each axis. Make sure you understand the units of measure. Do you understand the independent variable? Do you understand the dependent variable? Can you describe each one in your own words? If the answer to any of these questions is “No,” you don’t understand the graph you’re looking at.
When looking at a graph in a TV ad, be very wary if it disappears too quickly, before you can see what’s on the axes. The advertiser may be trying to create a lingering false impression about a bogus relationship inside the graph. The graphed relationship might be as valid as that other staple of TV advertising — scientific proof via animated cartoon: Tiny animated scrub brushes cleaning cartoon teeth might not necessarily guarantee whiter teeth for you if you buy the product.
Graphing a categorical variable as though it’s a quantitative variable is just wrong
So you’re just about ready to compete in the Rock-Paper-Scissors World Series. In preparation for this international tournament, you’ve tallied all your matches from the past ten years, listing the percentage of times you won when you played each role.
To summarize all the outcomes, use Excel’s graphics capabilities to create a graph.
So many people create these kinds of graphs — people who should know better. The line in the graph implies continuity from one point to another. With these data, of course, that’s impossible. What’s between rock and paper? Why are they equal units apart? Why are the three categories in that order?
Simply put, a line graph is not the proper graph when at least one of your variables is a set of categories. Instead, create a column graph. A pie chart works here, too, because the data are percentages and you have just a few slices.
Whenever appropriate, include variability in your graph
When the points in your graph represent means, make sure that the graph includes the standard error of each mean. This gives the viewer an idea of the variability in the data — which is an important aspect of the data.
Means by themselves don’t always tell you the whole story. Take every opportunity to examine variances and standard deviations. You may find some hidden nuggets. Systematic variation — high values of variance associated with large means, for example — might be a clue about a relationship you didn’t see before.
Be careful when relating statistics textbook concepts to Excel
If you’re serious about doing statistical work, you’ll probably have occasion to look into a statistics text or two. Bear in mind that the symbols in some areas of statistics aren’t standard.
Connecting textbook concepts to Excel’s statistical functions can be a challenge because of the texts and because of Excel. Messages in dialog boxes and in Help files might contain symbols other than the ones you read about, or they might use the same symbols but in a different way. This discrepancy might lead you to make an incorrect entry into a parameter in a dialog box, resulting in an error that’s hard to trace.