Tips for Dealing with Big Data in Python

By John Paul Mueller, Luca Massaron

Using Python to deal with real data is sometimes a little more tricky than the examples you read about. Real data, apart from being messy, can also be quite big in data science — sometimes so big that it can’t fit in memory, no matter what the memory specifications of your machine are.

Determining when there is too much data

In a data science project, data can be deemed big when one of these two situations occur:

  • It can’t fit in the available computer memory.

  • Even if the system has enough memory to hold the data, the application can’t elaborate the data using machine-learning algorithms in a reasonable amount of time.

Implementing Stochastic Gradient Descent

When you have too much data, you can use the Stochastic Gradient Descent Regressor (SGDRegressor) or Stochastic Gradient Descent Classifier (SGDClassifier) as a linear predictor. The only difference with most other methods is that they actually optimize their coefficients using only one observation at a time. It therefore takes more iterations before the code reaches comparable results using a ridge or lasso regression, but it requires much less memory and time.

This is because both predictors rely on Stochastic Gradient Descent (SGD) optimization — a kind of optimization in which the parameter adjustment occurs after the input of every observation, leading to a longer and a bit more erratic journey toward minimizing the error function. Of course, optimizing based on single observations, and not on huge data matrices, can have a tremendous beneficial impact on the algorithm’s training time and the amount of memory resources.

When using the SGDs, apart from different cost functions that you have to test for their performance, you can also try using L1, L2, and Elasticnet regularization just by setting the penalty parameter and the corresponding controlling alpha and l1_ratio parameters. Some of the SGDs are more resistant to outliers, such as modified_huber for classification or huber for regression.

SGD is sensitive to the scale of variables, and that’s not just because of regularization, it’s because of the way it works internally. Consequently, you must always standardize your features (for instance, by using StandardScaler) or you force them in the range [0,+1] or [-1,+1]. Failing to do so will lead to poor results.

When using SGDs, you’ll always have to deal with chunks of data unless you can stretch all the training data into memory. To make the training effective, you should standardize by having the StandardScaler infer the mean and standard deviation from the first available data. The mean and standard deviation of the entire dataset is most likely different, but the transformation by an initial estimate will suffice to develop a working learning procedure.

from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import StandardScaler
SGD = SGDRegressor(loss=‘squared_loss’, penalty=‘l2’, alpha=0.0001, 
l1_ratio=0.15, n_iter=2000) scaling = StandardScaler() scaling.fit(polyX) scaled_X = scaling.transform(polyX) print ‘CV MSE: %.3f’ % abs(np.mean(cross_val_score(SGD, scaled_X, y, scoring=‘mean_squared_error’, cv=crossvalidation, n_jobs=1))) CV MSE: 12.802

In the preceding example, you used the fit method, which requires that you preload all the training data into memory. You can train the model in successive steps by using the partial_fit method instead, which runs a single iteration on the provided data, then keeps it in memory and adjusts it when receiving new data.

from sklearn.metrics import mean_squared_error
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(scaled_X, y, test_size=0.20,
 random_state=2)
SGD = SGDRegressor(loss=‘squared_loss’, penalty=‘l2’, alpha=0.0001, 
l1_ratio=0.15) improvements = list() for z in range(1000): SGD.partial_fit(X_train, y_train) improvements.append(mean_squared_error(y_test, SGD.predict(X_test)))

Having kept track of the algorithm’s partial improvements during 1000 iterations over the same data, you can produce a graph and understand how the improvements work as shown in the following code. It’s important to note that you could have used different data at each step.

import matplotlib.pyplot as plt
plt.subplot(1,2,1)
plt.plot(range(1,11),np.abs(improvements[:10]),’o--’)
plt.xlabel(‘Partial fit initial iterations’)
plt.ylabel(‘Test set mean squared error’)
plt.subplot(1,2,2)
plt.plot(range(100,1000,100),np.abs(improvements[100:1000:100]),’o--’)
plt.xlabel(‘Partial fit ending iterations’)
plt.show()

The algorithm initially starts with a high error rate, but it manages to reduce it in just a few iterations, usually 5. After that, the error rate slowly improves by a smaller amount each iteration. After 700 iterations, the error rate reaches a minimum and starts increasing. At that point, you’re starting to overfit because data has already caught the rules and you’re actually forcing the SGD to learn more when there is nothing left in data other than noise. Consequently, it starts learning noise and erratic rules.

A slow descent optimizing squared error.

A slow descent optimizing squared error.

Unless you’re working with all the data in memory, grid-searching and cross-validating the best number of iterations will be difficult. A good trick is to keep a chunk of training data to use for validation apart in memory or storage. By checking your performance on that untouched part, you can see when SGD learning performance starts decreasing. At that point, you can interrupt data iteration (a method known as early stopping).