Neural Networks and Deep Learning: Neural Network Differentiation - dummies

Neural Networks and Deep Learning: Neural Network Differentiation

By John Paul Mueller, Luca Mueller

Once you know how neural networks basically work, you need a better understanding of what differentiates them to understand their role in deep learning. Beyond the different neural network architectures, the choice of the activation functions, optimizers and the neural network’s learning rate can make the difference. Knowing basic operations isn’t enough because you won’t get the results you want. Looking under the hood of a neural network helps you understand how you can tune your solution to model specific problems. In addition, understanding the various algorithms used to create a neural network will help you obtain better results with less effort and in a shorter time. The following article focuses on three areas of neural network differentiation.

Choosing the right activation function for your neural network

An activation function is the part of a neural network that simply defines when a neuron fires. Consider it a sort of tipping point: Input of a certain value won’t cause the neuron to fire because it’s not enough, but just a little more input can cause the neuron to fire. A neuron is defined in a simple manner as follows:

y = ∑ (weight * input) + bias

The output, y, can be any value between + infinity and – infinity. The problem, then, is to decide on what value of y is the firing value, which is where an activation function comes into play in your neural network. The activation function determines which value is high or low enough to reflect a decision point in the neural network for a particular neuron or group of neurons.

As with everything else in neural networks, you don’t have just one activation function. You use the activation function that works best in a particular scenario. With this in mind, you can break the activation functions into these categories:

  • Step: A step function (also called a binary function) relies on a specific threshold for making the decision about activating or not. Using a step function means that you know which specific value will cause an activation. However, step functions are limited in that they’re either fully activated or fully deactivated —no shades of gray exist. Consequently, when attempting to determine which class is most likely correct based in a given input, a step function won’t work.
  • Linear: A linear function (A = cx) provides a straight-line determination of activation based on input. Using a linear function helps you determine which output to activate based on which output is most correct (as expressed by weighting). However, linear functions work only as a single layer. If you were to stack multiple linear function layers, the output would be the same as using a single layer, which defeats the purpose of using neural networks. Consequently, a linear function may appear as a single layer, but never as multiple layers.
  • Sigmoid: A sigmoid function (A = 1 / 1 + e-x), which produces a curve shaped like the letter C or S, is nonlinear. It begins by looking sort of like the step function, except that the values between two points actually exist on a curve, which means that you can stack sigmoid functions to perform classification with multiple outputs. The range of a sigmoid function is between 0 and 1, not – infinity to + infinity as with a linear function, so the activations are bound within a specific range. However, the sigmoid function suffers from a problem called vanishing gradient, which means that the function refuses to learn after a certain point because the propagated error shrinks to zero as it approaches far away layers.
  • Tanh: A tanh function (A = (2 / 1 + e-2x) – 1) is actually a scaled sigmoid function. It has a range of –1 to 1, so again, it’s a precise method for activating neurons. The big difference between sigmoid functions and tanh functions is that the tanh function gradient is stronger, which means that detecting small differences is easier, making classification more sensitive. Like the sigmoid function, tanh suffers from vanishing gradient issues.
  • ReLU: A ReLU, or Rectified Linear Units, function (A(x) = max(0, x)) provides an output in the range of 0 to infinity, so it’s similar to the linear function except that it’s also nonlinear, enabling you to stack ReLU functions. An advantage of ReLU is that it requires less processing power because fewer neurons fire. The lack of activity as the neuron approaches the 0 part of the line means that there are fewer potential outputs to look at. However, this advantage can also become a disadvantage when you have a problem called the dying ReLU. After a while, the neural network weights don’t provide the desired effect any longer (it simply stops learning) and the affected neurons die — they don’t respond to any input.

Also, the ReLU has some variants that you should consider:

  • ELU (Exponential Linear Unit): Differs from ReLU when the inputs are negative. In this case, the outputs don’t go to zero but instead slowly decrease to –1 exponentially.
  • PReLU (Parametric Rectified Linear Unit): Differs from ReLU when the inputs are negative. In this case, the output is a linear function whose parameters are learned using the same technique as any other parameters of the network.
  • LeakyReLU: Similar to PReLU but the parameter for the linear side is fixed.

Relying on a smart optimizer for your neural network

An optimizer serves to ensure that your neural network performs fast and correctly models whatever problem you want to solve by modifying the neural network’s biases and weights (see this article for more on improving your machine learning models). It turns out that an algorithm performs this task, but you must choose the correct algorithm to obtain the results you expect. As with all neural network scenarios, you have a number of optional algorithm types from which to choose:

Stochastic gradient descent (SGD)

  • RMSProp
  • AdaGrad
  • AdaDelta
  • AMSGrad
  • Adam and its variants, Adamax and Nadam

An optimizer works by minimizing or maximizing the output of an objective function (also known as an error function) represented as E(x). This function is dependent on the model’s internal learnable parameters used to calculate the target values (Y) from the predictors (X). Two internal learnable parameters are weights (W) and bias (b). The various algorithms have different methods of dealing with the objective function.

You can categorize the optimizer functions by the manner in which they deal with the derivative (dy/dx), which is the instantaneous change of y with respect to x. Here are the two levels of derivative handling:

  • First order: These algorithms minimize or maximize the objective function using gradient values with respect to the parameters.
  • Second order: These algorithms minimize or maximize the object function using the second-order derivative values with respect to the parameters. The second-order derivative can give a hint as to whether the first-order derivative is increasing or decreasing, which provides information about the curvature of the line.

You commonly use first-order optimization techniques in neural networks, such as Gradient Descent, because they require fewer computations and tend to converge to a good solution relatively fast when working on large datasets.

Setting a working learning rate in your neural network

Each optimizer has completely different parameters to tune in your neural network. One constant is fixing the learning rate, which represents the rate at which the code updates the network’s weights (such as the alpha parameter). The learning rate can affect both the time the neural network takes to learn a good solution (the number of epochs) and the result. In fact, if the learning rate is too low, your network will take forever to learn. Setting the value too high causes instability when updating the weights, and the network won’t ever converge to a good solution.

Choosing a learning rate that works and training your neural network is daunting because you can effectively try values in the range from 0.000001 to 100. The best value varies from optimizer to optimizer. The value you choose depends on what type of data you have. Theory can be of little help here; you have to test different combinations before finding the most suitable learning rate for training your neural network successfully.

In spite of all the math surrounding them, tuning neural networks and having them work best is mostly a matter of empirical efforts in trying different combinations of architectures and parameters.

Take the time to evaluate the learning rate and set it appropriately to ensure your neural network functions optimally.