How to Use Curve Fitting in Predictive Analytics
Curve fitting is a process used in predictive analytics in which the goal is to create a curve that depicts the mathematical function that best fits the actual (original) data points in a data series.
The curve can either pass through every data point or stay within the bulk of the data, ignoring some data points in hopes of drawing trends from the data. In either case, one single mathematical function is assigned to the entire body of data, with the goal of fitting all data points into a curve that delineates trends and aids prediction.
Curve fitting can be achieved in one of three ways:
By finding an exact fit for every data point (a process called interpolation)
By staying within the bulk of the data while ignoring some of data points in hopes of drawing trends out of the data
By employing data smoothing to come up with a function that represents the smoothed graph
Curve fitting can be used to fill in possible data points to replace missing values or help analysts visualize the data.
When you’re working to generate a predictive analytics model, avoid tailoring your model to fit your data sample perfectly. Such a model will fail — miserably — to predict similar yet varying datasets outside the data sample. Fitting a model too closely to a particular data sample is a classic mistake called overfitting.
The woes of overfitting
In essence, overfitting a model is what happens when you overtrain the model to represent only your sample data — which isn’t a good representation of the data as a whole. Without a more realistic dataset to go on, the model can then be plagued with errors and risks when it goes operational — and the consequences to your business can be serious.
Overfitting a model is a common trap because people want to create models that work — and so are tempted to keep tweaking variables and parameters until the model performs perfectly — on too little data. To err is human. Fortunately, it’s also human to create realistic solutions.
To avoid overfitting your model to your sample dataset, be sure to have a body of test data available that’s separate from your sample data. Then you can measure the performance of your model independently before making the model operational.
Thus one general safeguard against overfitting is to divide your data to two parts: training data and test data. The model’s performance against the test data will tell you a lot about whether the model is ready for the real world.
Another best practice is to make sure that your data represents the larger population of the domain you’re modeling for. All an overtrained model knows is the specific features of the sample dataset it’s trained for. If you train the model only on (say) snowshoe sales in winter, don’t be surprised if it fails miserably when it’s run again on data from any other season.
How to avoid overfitting
It’s worth repeating: Too much tweaking of the model is apt to result in overfitting. One such tweak is including too many variables in the analysis. Keep those variables to a minimum. Only include variables that you see as absolutely necessary — those you believe will make a significant difference to the outcome.
This insight only comes from intimate knowledge of the business domain you’re in. That’s where the expertise of domain experts can help keep you from falling into the trap of overfitting.
Here’s a checklist of best practices to help you avoid overfitting your model:
Chose a dataset to work with that is representative of the population as a whole.
Divide your dataset to two parts: training data and test data.
Keep the variables analyzed to a healthy minimum for the task at hand.
Enlist the help of domain knowledge experts.
In the stock market, for example, a classic analytical technique is back-testing — running a model against historical data to look for the best trading strategy.
Suppose that, after running his new model against data generated by a recent bull market, and tweaking the number of variables used in his analysis, the analyst creates what looks like an optimal trading strategy — one that would yield the highest returns if he could go back and trade only during the year that produced the test data. Unfortunately, he can’t.
If he tries to apply that model in a current bear market, look out below: He’ll incur losses by applying a model too optimized for a narrow period of time and set of conditions that don’t fit current realities. (So much for hypothetical profits.)
The model worked only for that vanished bull market because it was overtrained, bearing the earmarks of the context that produced the sample data — complete with its specifics, outliers, and shortcomings. All the circumstances surrounding that dataset probably won’t be repeated in the future, or in a true representation of the whole population — but they all showed up in the overfitted model.
If a model’s output is too accurate, consider that a hint to take a closer look. Enlist the help of domain knowledge experts to see whether your results really are too good to be true, and run that model on more test data for further comparisons.