Before you can feed the Support Vector Machine (SVM) classifier with the data that was loaded for predictive analytics, you must split the full dataset into a training set and test set.

Fortunately, scikit-learn has implemented a function that will help you to easily split the full dataset. The train_test_split function takes as input a single dataset and a percentage value. The percentage value is used to determine the size of the test set. The function returns two datasets: the test dataset (with its size specified) and the training dataset (which uses the remaining data).

Typically, one can take around 70-80 percent of the data to use as a training set and use the remaining data as the test set. But the Iris dataset is very small (only 150 instances), so you can take 90 percent of it to train the model and use the other 10 percent as test data to see how your predictive model will perform.

Type in the following code to split your dataset:

>>> from sklearn import cross_validation
>>> X_train, X_test, y_train, y_test =   cross_validation.train_test_split(,, test_size=0.10, random_state=111)

The first line imports cross-validation library into your session. The second line creates the test set from 10 percent of the sample.

x_train will contain 135 observations and its features.
y_train will contain 135 labels in the same order as the 135 observations.
x_test will contain 15 (or 10 percent) observations and its features.
y_test will contain 15 labels in the same order as the 15 observations.

The following code verifies that the split is what you expected:

>>> X_train.shape
(135, 4)
>>> y_train.shape
>>> X_test.shape
(15, 4)
>>> y_test.shape

You can see from the output that there are 135 observations with 4 features and 135 labels in the training set. The test set has 15 observations with 4 features and 15 labels.

Many beginners in the field of predictive analytics forget to split the datasets — which introduces a serious design flaw into the project. If the full 150 instances were loaded into the machine as training data, that would leave no unseen data for testing the model. Then you’d have to resort to reusing some of the training instances to test the predictive model.

You’ll see that in such a situation, the model always predicts the correct class — because you’re using the same exact data you used to train the model. The model has already seen this pattern before; it will have no problem just repeating what it’s seen. A working predictive model needs to make predictions for data that it hasn’t seen yet.

When you have an instance of an SVM classifier, a training dataset, and a test dataset, you’re ready to train the model with the training data. Typing the following code into the interpreter will do exactly that:


This line of code creates a working model to make predictions from. Specifically, a predictive model that will predict what class of Iris a new unlabeled dataset belongs to. The svmClassifier instance will have several methods that you can call to do various things.

For example, after calling the fit method, the most useful method to call is the predict method. That’s the method to which you’ll feed new data; in return, it predicts the outcome.