Model Building with Stepwise Regression

Biology Essentials For Dummies

One of the reasons (but not the only reason) for running a multiple regression analysis is to come up with a prediction formula for some outcome variable, based on a set of available predictor variables. Ideally, you’d like this formula to be parsimonious — to have as few variables as possible, but still make good predictions.

So, how do you select, from among a big bunch of predictor variables, the smallest subset needed to make a good prediction model? This is called the “model building” problem, which is a topic of active research by theoretical statisticians. No single method has emerged as the best way to select which variables to include. Unfortunately, researchers often use informal methods that seem reasonable but really aren’t very good, such as the following:

Do a big multiple regression using all available predictors, and then drop the ones that didn’t come out significant. This approach may miss some important predictors because of collinearity.
Run univariate regressions on every possible predictor individually, and then select only those predictors that were significant (or nearly significant) on the univariate tests. But sometimes a truly important predictor variable isn’t significantly associated with the outcome when tested by itself, but only when the effects of some other variable have been compensated for. This problem is the reverse of the disappearing significance problem — it’s not nearly as common, but it can happen.

There is another way — many statistics packages offer stepwise regression, in which you provide all the available predictor variables, and the program then goes through a process similar to what a human (with a logical mind and a lot of time on his hands) might do to identify the best subset of those predictors. The program very systematically tries adding and removing the various predictors from the model, one at a time, looking to see which predictors, when added to a model, substantially improve its predictive ability, or when removed from the model, make it substantially worse.

Stepwise regression can utilize several different algorithms, and models can be judged to be better or worse by several different criteria. In general, these methods often do a decent job of the following:

Detecting and dropping variables that aren’t associated with the outcome, either in univariate or multiple regression
Detecting and dropping redundant variables (predictors that are strongly associated with even better predictors of the outcome)
Detecting and including variables that may not have been significant in univariate regression but that are significant when you adjust for the effects of other variables

Most stepwise regression software also lets you “force” certain variables into the model, if you know (from physiological evidence) that these variables are important predictors of the outcome.

About This Article

About the book author:

John C. Pezzullo, PhD, has held faculty appointments in the departments of biomathematics and biostatistics, pharmacology, nursing, and internal medicine at Georgetown University. He is semi-retired and continues to teach biostatistics and clinical trial design online to Georgetown University students.