# Playing with Scikit-Learn and Neural Networks

Starting with the idea of reverse-engineering how a brain processes signals, researchers based neural networks on biological analogies and their components, using brain terms such as *neurons* and *axons* as names. However, you’ll discover that neural networks resemble nothing more than a sophisticated kind of linear regression because they are a summation of coefficients multiplied by numeric inputs. You also find that neurons are just where such summations happen.

Even if neural networks don’t mimic a brain (they’re arithmetic), these algorithms are extraordinarily effective against complex problems such as image and sound recognition, or machine language translation. They also execute quickly when predicting, if you use the right hardware. Well-devised neural networks use the name *deep learning* and are behind powerful tools like Siri and other digital assistants, along with more astonishing machine learning applications as well.

Running deep learning requires special hardware (a computer with a GPU) and installing special frameworks such as Tensorflow MXNet, Pytorch or Chainer. This book doesn’t delve into complex neural networks but does explore a simpler implementation offered by Scikit-learn instead, which allows you to create neural network quickly and compare them to other machine learning algorithms.

## Understanding neural networks

The core neural network algorithm is the neuron (also called a unit). Many neurons arranged in an interconnected structure make up the layers of a neural network, with each neuron linking to the inputs and outputs of other neurons. Thus, a neuron can input features from examples or from the results of other neurons, depending on its location in the neural network.

Contrary to other algorithms, which have a fixed pipeline that determines how algorithms receive and process data, neural networks require you to decide how information flows by fixing the number of units (the neurons) and their distribution in layers. For this reason, setting up neural networks is more an art than a science; you learn from experience how to arrange neurons into layers and obtain the best predictions. In a more detailed view, neurons in a neural network take many weighted values as inputs, sum them, and provide the summation as the result.

A neural network can process only numeric, continuous information; it can’t process qualitative variables (for example, labels indicating a quality such as red, blue, or green in an image). You can process qualitative variables by transforming them into a continuous numeric value, such as a series of binary values

Neurons also provide a more sophisticated transformation of the summation. In observing nature, scientists noticed that neurons receive signals but don’t always release a signal of their own. It depends on the amount of signal received. When a neuron in a brain acquires enough stimuli, it fires an answer; otherwise, it remains silent. In a similar fashion, neurons in a neural network, after receiving weighted values, sum them and use an activation function to evaluate the result, which transforms it in a nonlinear way. For instance, the activation function can release a zero value unless the input achieves a certain threshold, or it can dampen or enhance a value by nonlinearly rescaling it, thus transmitting a rescaled signal.

Each neuron in the network receives inputs from the previous layers (when starting, it connects directly with data), weights them, sums them all, and transforms the result using the activation function. After activating, the computed output becomes the input for other neurons or the prediction of the network. Consequently, given a neural network made of a certain number of neurons and layers, what makes this structure efficient in its predictions is the weights used by each neuron for its inputs. Such weights aren’t different from the coefficients of a linear regression, and the network learns their value by repeated passes (iterations or epochs) over the examples of the dataset.

## Classifying and regressing with neurons using Scikit-learn

If you plan to work with neural networks and Python, you’ll need Scikit-learn.Scikit-learn offers two functions for neural networks:

`MLPClassifier`

: Implements a multilayer perceptron (MLP) for classification. Its outputs (one or many, depending on how many classes you have to predict) are intended as probabilities of the example being of a certain class.`MLPRegressor`

: Implements MLP for regression problems. All its outputs (because it can predict multiple target values at one time) are intended as estimates of the measures to predict.

Because both functions have the exact same parameters, the Scikit-learn example delves into a single example for classification, using the handwritten digits as an example of multiclass classification using a MLP. The example starts by importing the necessary packages, loading the dataset into memory, and splitting it into a training and a test set:

from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score from sklearn.preprocessing import MinMaxScaler from sklearn import datasets from sklearn.neural_network import MLPClassifier digits = datasets.load_digits() X, y = digits.data, digits.target X_tr, X_t, y_tr, y_t = train_test_split(X, y, test_size=0.3, random_state=0)

Preprocessing the Scikit-learn data to feed to the neural network is an important aspect because the operations that neural networks perform under the hood are sensitive to the scale and distribution of data. Consequently, it’s good practice to normalize the data by putting its mean to zero and its variance to one, or to rescale it by fixing the minimum and maximum between –1 and +1 or 0 and +1. Experimentation shows which transformation works better for your data, though most people find that rescaling between –1 and +1 works better. This Scikit-learn example rescales all the values between –1 and +1.

scaling = MinMaxScaler(feature_range=(-1, 1)).fit(X_tr) X_tr = scaling.transform(X_tr) X_t = scaling.transform(X_t)

It’s good practice to define the preprocessing transformations on the training data alone and then apply the learned procedure to the test data. Only in this way can you correctly test how your model works with different data.

To define MLP, you must consider that there are quite a few parameters and if you don’t tune them correctly, the results may be disappointing. (MLP is not an algorithm that works out of the box.) For an MLP to work properly, you should first define the architecture of the neurons, setting how many to use for each layer and how many layers to create. (You state the number of neurons for each layer in the `hidden_layer_sizes`

parameter.) Then you have to determine the right solver among:

**L-BFGS:**Use for small datasets.**Adam:**Use for large datasets.**SGD:**Excels at most problems if you correctly set some special parameters. L-BFGS works for small datasets, Adam for large ones, and SDG can excel at most problems if you set its parameters correctly. SGD’s parameters are the learning rate, which can reflect learning speed, and momentum (or Nesterov’s momentum), a value that helps the neural network to avoid less useful solutions. When specifying the learning rate, you have to define its starting value (`learning_rate_init`

, which is usually around 0.001, but it can be even less) and how the speed changes during training (the`learning_rate`

parameter, which can be`'constant'`

,`'invscaling'`

, or`'adaptive'`

).

Given the complexity of setting the parameters for an SGD solver, you can determine how they work on your data only by testing them in a hyperparameter optimization. Most people prefer to start with an L-BFGS or Adam solver.

Another critical hyperparameter is `max_iter`

, the number of iterations, which can lead to completely different results if you set it too low or too high. The default is 200 iterations, but it’s always better, after having fixed the other parameters, to try to increase or decrease its number. Finally, shuffling the data (`shuffle=True`

) and setting a `random_state`

for reproducibility of results are also important. The example code sets 512 nodes on a single layer, relies on the Adam solver, and uses the standard number of iterations (200):

nn = MLPClassifier(hidden_layer_sizes=(512, ), activation='relu', solver='adam', shuffle=True, tol=1e-4, random_state=1) cv = cross_val_score(nn, X_tr, y_tr, cv=10) test_score = nn.fit(X_tr, y_tr).score(X_t, y_t) print('CV accuracy score: %0.3f' % np.mean(cv)) print('Test accuracy score: %0.3f' % (test_score))

Using this code, the example successfully classifies handwritten digits by running an MLP whose CV and test score are

CV accuracy score: 0.978 Test accuracy score: 0.981

The results obtained are a little better than SVC’s, yet the increase involves tuning quite a few parameters correctly as well. When using nonlinear algorithms, you can’t expect any no-brainer approach, apart from a few decision-tree based solutions.