Running in Parallel Python for Data Science

By John Paul Mueller, Luca Massaron

Most computers today are multicore (two or more processors in a single package), some with multiple physical CPUs. One of the most important limitations of Python is that it uses a single core by default. (It was created in a time when single cores were the norm.)

Data science projects require quite a lot of computations. In particular, a part of the scientific aspect of data science relies on repeated tests and experiments on different data matrices. Don’t forget that working with huge data quantities means that most time-consuming transformations repeat observation after observation (for example, identical and not related operations on different parts of a matrix).

Using more CPU cores accelerates a computation by a factor that almost matches the number of cores. For example, having four cores would mean working at best four times faster. You don’t receive a full fourfold increase because there is overhead when starting a parallel process — new running Python instances have to be set up with the right in-memory information and launched; consequently, the improvement will be less than potentially achievable but still significant.

Knowing how to use more than one CPU is therefore an advanced but incredibly useful skill for increasing the number of analyses completed, and for speeding up your operations both when setting up and when using your data products.

Multiprocessing works by replicating the same code and memory content in various new Python instances (the workers), calculating the result for each of them, and returning the pooled results to the main original console. If your original instance already occupies much of the available RAM memory, it won’t be possible to create new instances, and your machine may run out of memory.

Performing multicore parallelism

To perform multicore parallelism with Python, you integrate the Scikit-learn package with the joblib package for time-consuming operations, such as replicating models for validating results or for looking for the best hyper-parameters. In particular, Scikit-learn allows multiprocessing when

  • Cross-validating: Testing the results of a machine-learning hypothesis using different training and testing data

  • Grid-searching: Systematically changing the hyper-parameters of a machine-learning hypothesis and testing the consequent results

  • Multilabel prediction: Running an algorithm multiple times against multiple targets when there are many different target outcomes to predict at the same time

  • Ensemble machine-learning methods: Modeling a large host of classifiers, each one independent from the other, such as when using RandomForest-based modeling

You don’t have to do anything special to take advantage of parallel computations — you can activate parallelism by setting the n_jobs ­parameter to a number of cores more than 1 or by setting the value to –1, which means you want to use all the available CPU instances.

If you aren’t running your code from the console or from an IPython Notebook, it is extremely important that you separate your code from any package import or global variable assignment in your script by using the if __name__==__main__: command at the beginning of any code that executes multicore parallelism. The if statement checks whether the program is directly run or is called by an already-running Python console, avoiding any confusion or error by the multiparallel process (such as recursively calling the parallelism).

Demonstrating multiprocessing

It’s a good idea to use IPython when you run a demonstration of how multiprocessing can really save you time during data science projects. Using IPython provides the advantage of using the %timeit magic command for timing execution. You start by loading a multiclass dataset, a complex machine-learning algorithm (the Support Vector Classifier, or SVC), and a cross-validation procedure for estimating reliable resulting scores from all the procedures.

The most important thing to know is that the procedures become quite large because the SVC produces 10 models, which it repeats 10 times each using cross-validation, for a total of 100 models.

from sklearn.datasets import load_digits
digits = load_digits()
X, y = digits.data,digits.target
from sklearn.svm import SVC
from sklearn.cross_validation import cross_val_score
%timeit single_core_learning = cross_val_score(SVC(), X,
  y, cv=20, n_jobs=1)
Out [1] : 1 loops, best of 3: 17.9 s per loop

After this test, you need to activate the multicore parallelism and time the results using the following commands:

%timeit multi_core_learning = cross_val_score(SVC(), X, y,
  cv=20, n_jobs=-1)
Out [2] : 1 loops, best of 3: 11.7 s per loop

The example machine demonstrates a positive advantage using multicore processing, despite using a small dataset where Python spends most of the time starting consoles and running a part of the code in each one. This overhead, a few seconds, is still significant given that the total execution extends for a handful of seconds. Just imagine what would happen if you worked with larger sets of data — your execution time could be easily cut by two or three times.

Although the code works fine with IPython, putting it down in a script and asking Python to run it in a console or using an IDE may cause errors because of the internal operations of a multicore task. The solution is to put all the code under an if statement, which checks whether the program started directly and wasn’t called afterward. Here’s an example script:

from sklearn.datasets import load_digits
from sklearn.svm import SVC
from sklearn.cross_validation import cross_val_score
if __name__ == ‘__main__’:
  digits = load_digits()
  X, y = digits.data,digits.target
  multi_core_learning = cross_val_score(SVC(), X, y,
   cv=20, n_jobs=-1)