Data Science: Creating a Stochastic Solution with Support Vector Machine in Python

By John Paul Mueller, Luca Massaron

Support Vector Machine (SVM) machine-learning algorithms are a fantastic tool for a data scientist to use with Python. Of course, even the best solutions have problems. For example, you might think that the SVM has too many parameters in the SVM. Certainly, the parameters are a nuisance, especially when you have to test so many combinations of them, which can take a lot of CPU time.

However, the key problem is the time necessary for training the SVM. You may have noticed that the examples use small datasets with a limited number of variables, and performing some extensive grid searches still takes a lot of time. Real-world datasets are much bigger. Sometimes it may seem to take forever to train and optimize your SVM on your computer.

A possible solution when you have too many cases (a suggested limit is 10,000 examples) is found inside the same SVM module, the LinearSVC class. This algorithm works only with the linear kernel and its focus is to classify (sorry, no regression) large numbers of examples and variables at a higher speed than the standard SVC. Such characteristics make the LinearSVC a good candidate for textual-based classification. LinearSVC has fewer and slightly different parameters to fix than the usual SVM (it’s similar to a regression class):

  • C: The penalty parameter. Small values imply more regularization (simpler models with attenuated or set to zero coefficients).

  • loss: A value of l1 (just as in SVM) or l2 (errors weight more, so it strives harder to fit misclassified examples).

  • penalty: A value of l2 (attenuation of less important parameters) or l1 (unimportant parameters are set to zero).

  • dual: A value of true or false. It refers to the type of optimization problem solved and, though it won’t change the obtained scoring much, setting the parameter to false results in faster computations than when it is set to true.

The loss, penalty, and dual parameters are also bound by reciprocal constraints, so please refer below to plan which combination to use in advance.

penalty loss dual
l1 l2 False
l2 l1 True
l2 l2 True; False

The algorithm doesn’t support the combination of penalty=l1 and loss=l1. However, the combination of penalty=l2 and loss=l1 perfectly replicates the SVC optimization approach.

LinearSVC is quite fast, and a speed test against SVC demonstrates the level of improvement to expect in choosing this algorithm.

from sklearn.datasets import make_classification
X,y = make_classification(n_samples=10**4, n_features=15, n_informative=10,
 random_state=101)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
 random_state=101)
from sklearn.svm import SVC, LinearSVC
slow_SVM = SVC(kernel=“linear”, random_state=101)
fast_SVM = LinearSVC(random_state=101, penalty=‘l2’, loss=‘l1’)
slow_SVM.fit(X_train, y_train)
fast_SVM.fit(X_train, y_train)
print ‘SVC test accuracy score: %0.3f’ % slow_SVM.score(X_test, y_test)
print ‘LinearSVC test accuracy score: %0.3f’ % fast_SVM.score(X_test, y_test)
SVC test accuracy score: 0.808
LinearSVC test accuracy score: 0.808

After you create an artificial dataset using make_classfication, the code obtains confirmation of how the two algorithms arrive at identical results. At this point, the code tests the speed of the two solutions on the synthetic dataset in order to understand how they scale to using more data.

import timeit
X,y = make_classification(n_samples=10**4, n_features=15, n_informative=10,
 random_state=101)
print ‘avg secs for SVC, best of 3: %0.1f’ 
% np.mean(timeit.timeit(“slow_SVM.fit(X, y)”, “from __main__ import slow_SVM, X, y”, number=1)) print ‘avg secs for LinearSVC, best of 3: %0.1f’ % np.mean( timeit.timeit(“fast_SVM.fit(X, y)”, “from __main__ import fast_SVM,
X, y”, number=1))

The example system shows the following result:

avg secs for SVC, best of 3: 15.9
avg secs for LinearSVC, best of 3: 0.4

Clearly, given the same data quantity, LinearSVC is much faster than SVC. You can calculate its performance ratio as 15.9 / 0.4 = 39.75 times faster than SVC. But what if you grow the sample size from 10**4 to 10**5?

avg secs for SVC, best of 3: 3831.6
avg secs for LinearSVC, best of 3: 10.0

The results are quite impressive. LinearSVC is 383.16 times faster than SVC. Even if LinearSVC is quite fast at performing tasks, you may need to classify or regress with examples in the range of millions. You need to know whether LinearSVC is still a better choice.

You previously saw how the SGD class, using SGDClassifier and SGDRegressor, helps you implement an SVM-type algorithm in situations with millions of data rows without investing too much computational power. All you have to do is to set their loss to hinge for SGDClassifier and to epsilon_insensitive for SGDRegressor.

Another performance and speed test makes the advantages and limitations of using LinearSVC or SGDClassifier clear:

from sklearn.linear_model import SGDClassifier
X,y = make_classification(n_samples=10**6, n_features=15, n_informative=10,
 random_state=101)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
 random_state=101)

The sample is quite big — 1 million cases. If you have enough memory and a lot of time, you may even want to increase the number of trained cases or the number of features and more extensively test how the two algorithms scale with big data.

fast_SVM = LinearSVC(random_state=101)
fast_SVM.fit(X_train, y_train)
print ‘LinearSVC test accuracy score: %0.3f’ % fast_SVM.score(X_test, y_test)
print ‘avg secs for LinearSVC, best of 3: %0.1f’ % np.mean(
 timeit.timeit(“fast_SVM.fit(X_train, y_train)”,
  “from __main__ import fast_SVM, X_train, y_train”, number=1))
LinearSVC test accuracy score: 0.806
avg secs for LinearSVC, best of 3: 311.2

On the test computer, LinearSVC completed its computations on 1 million rows in about five minutes. SGDClassifier instead took about a second for processing the same data and obtaining an inferior, but comparable, score.

stochastic_SVM = SGDClassifier(loss=‘hinge’, n_iter=5, shuffle=True, random_state=101)
stochastic_SVM.fit(X_train, y_train)
print ‘SGDClassifier test accuracy score: %0.3f’ % stochastic_SVM.score(X_test, y_test)
print ‘avg secs for SGDClassifier, best of 3: %0.1f’ % np.mean(
 timeit.timeit(“stochastic_SVM.fit(X_train, y_train)”,
  “from __main__ import stochastic_SVM, X_train, y_train”, number=1))
SGDClassifier test accuracy score: 0.799
avg secs for SGDClassifier, best of 3: 0.8

Increasing the n_iter parameter can improve the performance, but it proportionally increases the computation time. Increasing the number of iterations up to a certain value increases the performance. However, after that value, performance starts to decrease because of overfitting.