Once you have all the tools and data necessary to start creating a predictive model, the fun begins. In general, creating a learning model for classification tasks will entail the following steps:

  1. Load the data.

  2. Choose a classifier.

  3. Train the model.

  4. Visualize the model.

  5. Test the model.

  6. Evaluate the model.

Both the logistic regression and Support Vector Machine (SVM) classification models perform rather well using the Iris dataset.

Sepal Length Sepal Width Petal Length Petal Width Target Class/Label
5.1 3.5 1.4 0.2 Setosa (0)
7.0 3.2 4.7 1.4 Versicolor (1)
6.3 3.3 6.0 2.5 Virginica (2)

The logistic regression model with parameter C=1 was perfect in its predictions, while the SVM model and the logistic regression model with C=150 missed only one prediction. Indeed, the high accuracy of both models is a result of having a small dataset that has data points that are pretty close to linearly separable.

Interestingly, the logistic regression model with C=150 had a better-looking decision surface plot than the one with C=1, but it didn’t perform better. That’s not such a big deal, considering that the test set is so small. If another random split between training set and test set had been selected, the results could have easily been different.

This reveals another source of complexity that crops up in model evaluation: the effect of sampling, and how choosing the training and testing sets can affect the model’s output. Cross-validation techniques can help minimize the impact of random sampling on the model’s performance.

For a larger dataset with non-linearly separable data, you would expect the results to deviate even more. In addition, choosing the appropriate model becomes increasingly difficult due to the complexity and size of the data. Be prepared to spend a great deal of time tuning your parameters to get an ideal fit.

When creating predictive models, try a few algorithms and exhaustively tune their parameters until you find what works best for your data. Then compare their outputs against each other.