One of the most important decisions you make when specifying your econometric model is which variables to include as independent variables. Here, you find out what problems can occur if you include too few or too many independent variables in your model, and you see how this misspecification affects your results.

## Omitting relevant variables

If a variable that belongs in the model is excluded from the estimated regression function, the model is misspecified and may cause bias in the estimated coefficients.

You have an omitted variable bias if an excluded variable has some effect (positive or negative) on your dependent variable and it’s correlated with at least one of your independent variables.

The mathematical nature of specification bias can be expressed using a simple model. Suppose the true population model is given by

where X1 and X2 are the two variables that affect Y. But due to ignorance or lack of data, instead you estimate this regression:

which omits X2 from the independent variables. The expected value of

in this situation is

But this equation violates the Gauss-Markov theorem because

The magnitude of the bias can be expressed as

where

if the effect of X2 on Y and

is the slope from this regression:

which captures the correlation (positive or negative) between the included and excluded variable(s).

Summary of Omitted Variable Bias
Impact of Omitted Variable on Dependent Variable Correlation between Included and Omitted Variable:
Positive Negative
Positive Positive bias Negative bias
Negative Negative bias Positive bias

In practice, you’re likely to have some omitted variable bias because it’s impossible to control for everything that affects your dependent variable. However, you can increase your chances of minimizing omitted variable bias by avoiding simple regression models (with one independent variable) and including the variables that are likely to be the most important theoretically (and possibly, but not necessarily statistically) in explaining the dependent variable.

## Including irrelevant variables

If a variable doesn’t belong in the model and is included in the estimated regression function, the model is overspecified. If you overspecify the regression model by including an irrelevant variable, the estimated coefficients remain unbiased. However, it has an undesirable effect of increasing the standard errors of your coefficients.

In a simple regression model (with one independent variable), the estimated standard error of the regression coefficient for X is

where

is the estimated variance of the error and

is the total variation in X.

If you include additional independent variables in the model, the estimated standard error for any given regression coefficient is given by

where

is the R-squared from the regression of Xk on the other independent variables or Xs. Because

the numerator decreases. An irrelevant variable doesn’t help explain any of the variation in Y, so without an offsetting decrease in

the standard error increases.

Just because your estimated coefficient isn’t statistically significant doesn’t make it irrelevant. A well-specified model usually includes some variables that are statistically significant and some that aren’t. Additionally, variables that aren’t statistically significant can contribute enough explained variation to have no detrimental impact on the standard errors.