How to Calculate Multiple Linear Regression for Six Sigma

Project Management with AI For Dummies

What should Six Sigma practitioners do with all the situations where more than one X influences a Y? You use multiple linear regression. After all, that kind of situation is more common than a single influencing variable is. When you work to create an equation that includes more than one variable — such as Y = f(X₁, X₂, . . ., X_n).

The general form of the multiple linear regression model is simply an extension of the simple linear regression model For example, if you have a system where X₁ and X₂ both contribute to Y, the multiple linear regression model becomes

Y_i = β₀ + β₁X₁ + β₁₁X₁² + β₂X₂ + β₂₂X₂² + β₁₂X₁X₂ + ε

This equation features five distinct kinds of terms:

β₀: This term is the overall effect. It sets the starting level for all the other effects, regardless of what the X variables are set at.
β_iX_i: The β₁X₁ and β₂X₂ pieces are the main effects terms in the equation. Just like in the simple linear regression model, these terms capture the linear effect each X_i has on the output Y. The magnitude and direction of each of these effects are captured in the associated β_i coefficients.
β_iiX_i²: β₁₁X₁² and β₂₂X₂² are the second-order or squared effects for each of the Xs. Because the variable is raised to the second power, the effect is quadratic rather than linear. The magnitude and direction of each of these second-order effects are indicated by the associated β_ii coefficients.
β₁₂X₁X₂: This effect is called the interaction effect. This term allows the input variables to have an interactive or combined effect on the outcome Y. Once again, the magnitude and direction of the interaction effect are captured in the β₁₂ coefficient.
ε: This term accounts for all the random variation that the other terms can’t explain. ε is a normal distribution with its center at zero.

The equation for multiple linear regression can fit much more than a simple line; it can accommodate curves, three-dimensional surfaces, and even abstract relationships in n-dimensional space! Multiple linear regression can handle about anything you throw at it. The process for performing multiple linear regression follows the same pattern that simple linear regression does:

Gather the data for the Xs and the Y.
Estimate the multiple linear regression coefficients.

When you have more than one X variable, the equations for deriving the βs become very complex and very tedious. You definitely want to use a statistical analysis software tool to calculate these equations automatically for you. The βs just pop right out. Otherwise, go buy a box of number 2 pencils and roll up your sleeves!
Check the residual values to confirm that they meet the upfront assumptions of the multiple linear regression model.

Checking that the residuals are normal is critically important. If the variation of the residuals isn’t centered on zero and the variation isn’t random and normal, the starting assumptions of the multiple linear regression model haven’t been met, and the model is invalid.
Perform statistical tests to see which terms of the multiple linear regression equation terms are significant (and should be kept in the model) and which are insignificant (and need to be removed).

Some terms in the multiple regression equation aren’t significant. You find out which ones by performing an F test for each term in the equation. When the variation contribution of an equation term is small compared to the residual variation, that term won’t pass the F test, and you can remove it from the equation.

Your goal is to simplify the regression equation as much as possible while maximizing the R² metric of fit. Generally, simpler is always better. So if you find two regression equations that both have the same R² value, you want to settle on the one with the fewest, simplest terms.

Usually, the higher order terms are the first to go. There’s just less chance of a squared term or an interaction term being statistically significant.
Calculate the final coefficient of determination R² for the multiple linear regression model.

Use the R² metric to quantify how much of the observed variation your final equation explains.

With good analysis software becoming more accessible, the power of multiple linear regression is available to a growing audience. Many more sophisticated statistical analysis software tools even have automated algorithms that search through the various combinations of equation terms while maximizing R².

About This Article

About the book author:

Craig Gygi is Executive VP of Operations at MasterControl, a leading company providing software and services for best practices in automating and connecting every stage of quality/regulatory compliance, through the entire product life cycle. He is an operations executive and internationally recognized Lean Six Sigma thought leader and practitioner.

Bruce Williams is Vice President of Pegasystems, the world leader in business process management. He is a leading speaker and presenter on business and technology trends, and is co-author of Six Sigma Workbook for Dummies, Process Intelligence for Dummies, BPM Basics for Dummies and The Intelligent Guide to Enterprise BPM.

Neil DeCarlo was President of DeCarlo Communications.