# Using Linear Regression to Predict an Outcome

Statistical researchers often use a linear relationship to predict the (average) numerical value of* Y *for a given value of* X *using a straight line (called the *regression line*)*.* If you know the slope and the *y*-intercept of that regression line, then you can plug in a value for* X *and predict the average value for *Y.* In other words, you predict (the average)* Y *from *X.*

If you establish at least a moderate correlation between *X *and *Y* through both a correlation coefficient and a scatterplot, then you know they have some type of linear relationship.

Never do a regression analysis unless you have already found at least a moderately strong correlation between the two variables. (A good rule of thumb is it should be at or beyond either positive or negative 0.50.) If the data don’t resemble a line to begin with, you shouldn’t try to use a line to fit the data and make predictions (but people still try).

Before moving forward to find the equation for your regression line, you have to identify which of your two variables is* X *and which is* Y*. When doing correlations, the choice of which variable is* X** *and which is* Y *doesn’t matter, as long as you’re consistent for all the data. But when fitting lines and making predictions, the choice of* X *and* Y *does make a difference.

So how do you determine which variable is which? In general,* Y *is the variable that you want to predict, and *X* is the variable you are using to make that prediction. For example, say you are using the number of times a population of crickets chirp to predict the temperature. In this case you would make the variable *Y* the temperature, and the variable *X* the number of chirps. Hence* Y *can be predicted by* X *using the equation of a line if a strong enough linear relationship exists.

Statisticians call the *X*-variable (cricket chirps in this example) the *explanatory variable,* because if* X *changes, the slope tells you (or explains) how much* Y *is expected to change in response. Therefore, the *Y* variable is called the *response variable.* Other names for *X* and *Y* include the *independent* and *dependent* variables, respectively.

In the case of two numerical variables, you can come up with a line that enables you to predict* Y *from *X,* if (and only if) the following two conditions are met:

The scatterplot must form a linear pattern.

The correlation,

*r,*is moderate to strong (typically beyond 0.50 or –0.50).

Some researchers actually don’t check these conditions before making predictions. Their claims are not valid unless the two conditions are met.

But suppose the correlation is high; do you still need to look at the scatterplot? Yes. In some situations the data have a somewhat curved shape, yet the correlation is still strong; in these cases making predictions using a straight line is still invalid. Predictions in these cases need to be made based on other methods that use a curve instead.