How to Utilize Linear Regressions in Predictive Analytics

By Anasse Bari, Mohamed Chaouchi, Tommy Jung

Linear regression is a statistical method that analyzes and finds relationships between two variables. In predictive analytics it can be used to predict a future numerical value of a variable.

Consider an example of data that contains two variables: past data consisting of the arrival times of a train and its corresponding delay time. Suppose you want to predict what the delay would be for the next train. If you apply linear regression to these two variables — the arrival and delay times — you can generate a linear equation such as

Delay = a + (b * Arrival time) + d

This equation expresses the relationship between delay time and arrival time. The constants a and b are the model’s parameters. The variable d is the error term (also known as the remainder) — a numerical value that represents the mismatch between the two variables delay and arrival time. If the error is not equal to zero, then that might indicate that there are criteria affecting the variable delay.

If you’re sitting at the train station, you can simply plug the arrival time into the preceding equation and you can compute the expected delay, using the linear regression model’s given parameters a, b, and d.

Linear regression is (as you might imagine) most suitable for linear data. But it’s very sensitive toward outliers in the data points. The outliers in your data can have a significant impact on the model. It is recommended that you remove those outliers from the training set if you’re planning to use linear regression for your predictive model.