|
Correlation is a fundamental part of forecasting. You can do forecasting without knowing the first thing about correlation, but you handicap yourself if you don't bother. Correlations are key to understanding regression forecasts, and they play an important part in diagnosing how well your Excel sales forecasts work. Better yet, they're not really tough to understand.
When did they start going together?
You want to get your virtual hands on two variables. Begin by assuming those variables are people's height and weight. Now, you know just from general life experience that the taller a person is, the more the person tends to weigh. It's not anything like one for one — people don't automatically weigh 2-1/2 pounds more for every additional inch taller they are. But there is a strong tendency for height and weight to go together. Like teenagers spending time together, it's not easy to keep them apart.
Suppose you decide to do some research. You stand on a street corner in the business district of a large metropolitan area and stop people who are walking by. You engage each person in conversation and while you've got him distracted, you get his height with a tape measure. Then you ask him how much he weighs. Some tell you, and you note down the height and weight. (Some of them ignore the question and walk away, and that's why you have some missing data.)
After you collect data on about 50 people, you head back to your office and put the data in an Excel worksheet. It looks like the one in Figure 1.
Figure 1: You can't tell just by looking at the numbers that there's a fairly strong relationship between height and weight.
So far, it's just a jumble of numbers. When a numeric jumble gets in your face, the first thing to do is chart it, as seen in Figure 2.
Each point in the chart represents a different person you conned into letting you get a height and weight measurement. If you pick out one point and look over to the vertical axis, you can see what that person's height is. And if you look down from that point to the horizontal axis, you can see what the weight is.
Figure 2: Vertical gridlines can clutter up the chart, but some find it useful to add them by choosing Chart --> Chart Options.
Now the jumble is starting to resolve into some patterns:
- The points on the chart that are higher up the vertical axis also tend to be farther along the horizontal axis.
- The points describe a sort of cigar shape, running from the lower left to the upper right.
- The points do not lie directly on a straight line, but you can imagine one running through the middle of the cigar, as in Figure 3. Or you can draw it: Click the charted data series and choose Chart --> Add Trendline.
You can make some statements about the relationship between two variables:
- The closer those points come to lying on the imaginary straight line, the stronger the relationship between the two variables.
- You can express the strength of the relationship with a number. It turns out that, because of the way it's calculated, the number must be between –1 and +1. That number is called the correlation coefficient.
- If the correlation coefficient is positive, like 0.6, then the imaginary straight line runs from the lower left to the upper right. If the correlation coefficient is negative, then the line runs from the upper left to the lower right.
Figure 3: Lower left to upper right means a positive, or direct, correlation.
- The closer the correlation coefficient is to +1.0 or –1.0, the stronger the relationship. The closer it's to zero, the weaker the relationship.
Suppose you analyzed the golf scores of 100 golfers, whose skill levels range from beginner to expert. You could put them on a chart, just like height and weight, with, say, their scores on the vertical axis and their years of golfing experience on the horizontal axis (see Figure 4).
So you see that the more the years of experience playing golf, the lower the golf score. The fact that the correlation is negative has nothing to do with the strength of the relationship — just with its direction.
Using Excel, you can calculate the correlation coefficient between two variables very easily. Just use the CORREL function, with two ranges of data values as its arguments. In Figure 4, the formula used in cell D2 is:
=CORREL(A2:A30,B2:B30)
Figure 4: Upper left to lower right means a negative, or inverse, correlation.
Charting correlated data
When you use Excel to chart correlated data, it's almost always best to use an XY (Scatter) chart.
 | For simplicity, this is referred to as an XY chart. Excel uses the term Scatter in the name because earlier applications — and this goes back to mainframe days — referred to this kind of chart as a scatter chart or scattergram. |
The main reason that an XY chart is best is that, if you're working with correlations, you're automatically working with variables that are entirely numeric. Height and weight are both numeric variables. Golf score and years of experience are both numeric variables.
When you chart them, you want a chart that has a numeric horizontal (or X) axis, and a numeric vertical (or Y) axis. Other chart types don't offer this arrangement. For example, a Column chart assumes that categories (such as make of car or political affiliation) are on its horizontal axis, and numbers (such as number of cars sold or number of registered voters) are on its vertical axis.
The problem with putting numeric values on a category axis is that Excel doesn't reflect the magnitude of the difference between numbers in their spacing on the axis. So, the numbers 2, 4, and 8 would show up with equal distances between them on a category axis. But the difference between 2 and 4 is 2, and the difference between 4 and 8 is 4. Only a numeric axis can represent those differences accurately, and only an XY chart has two numeric axes to handle your two numeric variables.
There are other reasons to use an XY chart for two numeric variables. Among them:
- Trendlines are calculated and drawn accurately.
- R-squared values and regression equations are based on the proper values.
- You get some play between the vertical axis and the leftmost data point, and between the horizontal axis and the bottommost data point. This improves your visualization of the relationship between the two variables.
|