How to Calculate Multiple Linear Regression for Six Sigma
What should Six Sigma practitioners do with all the situations where more than one X influences a Y? You use multiple linear regression. After all, that kind of situation is more common than a single influencing variable is. When you work to create an equation that includes more than one variable — such as Y = f(X1, X2, . . ., Xn).
The general form of the multiple linear regression model is simply an extension of the simple linear regression model For example, if you have a system where X1 and X2 both contribute to Y, the multiple linear regression model becomes
Yi = β0 + β1X1 + β11X12 + β2X2 + β22X22 + β12X1X2 + ε
This equation features five distinct kinds of terms:
β0: This term is the overall effect. It sets the starting level for all the other effects, regardless of what the X variables are set at.
βiXi: The β1X1 and β2X2 pieces are the main effects terms in the equation. Just like in the simple linear regression model, these terms capture the linear effect each Xi has on the output Y. The magnitude and direction of each of these effects are captured in the associated βi coefficients.
βiiXi2: β11X12 and β22X22 are the second-order or squared effects for each of the Xs. Because the variable is raised to the second power, the effect is quadratic rather than linear. The magnitude and direction of each of these second-order effects are indicated by the associated βii coefficients.
β12X1X2: This effect is called the interaction effect. This term allows the input variables to have an interactive or combined effect on the outcome Y. Once again, the magnitude and direction of the interaction effect are captured in the β12 coefficient.
ε: This term accounts for all the random variation that the other terms can’t explain. ε is a normal distribution with its center at zero.
The equation for multiple linear regression can fit much more than a simple line; it can accommodate curves, three-dimensional surfaces, and even abstract relationships in n-dimensional space! Multiple linear regression can handle about anything you throw at it. The process for performing multiple linear regression follows the same pattern that simple linear regression does:
Gather the data for the Xs and the Y.
Estimate the multiple linear regression coefficients.
When you have more than one X variable, the equations for deriving the βs become very complex and very tedious. You definitely want to use a statistical analysis software tool to calculate these equations automatically for you. The βs just pop right out. Otherwise, go buy a box of number 2 pencils and roll up your sleeves!
Check the residual values to confirm that they meet the upfront assumptions of the multiple linear regression model.
Checking that the residuals are normal is critically important. If the variation of the residuals isn’t centered on zero and the variation isn’t random and normal, the starting assumptions of the multiple linear regression model haven’t been met, and the model is invalid.
Perform statistical tests to see which terms of the multiple linear regression equation terms are significant (and should be kept in the model) and which are insignificant (and need to be removed).
Some terms in the multiple regression equation aren’t significant. You find out which ones by performing an F test for each term in the equation. When the variation contribution of an equation term is small compared to the residual variation, that term won’t pass the F test, and you can remove it from the equation.
Your goal is to simplify the regression equation as much as possible while maximizing the R2 metric of fit. Generally, simpler is always better. So if you find two regression equations that both have the same R2 value, you want to settle on the one with the fewest, simplest terms.
Usually, the higher order terms are the first to go. There’s just less chance of a squared term or an interaction term being statistically significant.
Calculate the final coefficient of determination R2 for the multiple linear regression model.
Use the R2 metric to quantify how much of the observed variation your final equation explains.
With good analysis software becoming more accessible, the power of multiple linear regression is available to a growing audience. Many more sophisticated statistical analysis software tools even have automated algorithms that search through the various combinations of equation terms while maximizing R2.