10 Recommendations for Training Neural Networks

By Matthew Scarpino

In most software development efforts, an application will always do its job if you code it correctly. But when you work with neural networks, this isn’t the case. You can write flawless code and still end up with lousy results. No matter what the academics say, neural network development is not an exact science — there’s still a lot of art involved.

Here are ten recommendations that can help you improve the accuracy and performance of your neural networks. These general rules are based on experience. But keep in mind that neural networks are never completely reliable: Even a perfectly coded neural network can fail from time to time.

Select a Representative Dataset

This recommendation is the simplest because it doesn’t involve any math or software development. When it comes to training samples, more is better, but size isn’t the only priority. You need to make sure that your training dataset resembles the real world. Also, if your application classifies samples into categories, you need to make sure that you have a large number of samples for each category.

When it comes to image classification, you never know what bizarre features the neural network will focus on. For this reason, many developers add low levels of random noise to their input samples. This noise shouldn’t obfuscate the image but should force the neural network to pay attention to relevant characteristics.

Standardize Your Data

When you test a machine learning application or use it for practical prediction, you should make sure that the test data statistically resembles the training data. That is, the test/prediction data should have the same mean and standard deviation as the training data.

The process of setting the mean and standard deviation of a dataset is called standardization. Many applications standardize their data by setting the mean to 0 and setting the standard deviation to 1. In a TensorFlow application, you can accomplish this by calling tf.nn.moments and tf.nn.batch_normalization.

Use Proper Weight Initialization

Researchers have devised a number of mathematical procedures for initializing the weights of a neural network. One of the most popular methods is called the Glorot method or Xavier method. You can use this method in your applications by calling tf.contrib.layers.xavier_initializer.

Start with a Small Number of Layers

For complex problems, you probably won’t know how many hidden layers to create. Some developers assume that larger is better, and construct neural networks with many (more than 10) hidden layers. But this increases the likelihood of overfitting, in which the neural network becomes focused on your specific training data and fails to analyze general data.

To avoid overfitting, it’s a good idea to start small. If the accuracy is unacceptable, increase the network’s depth until the accuracy reaches a suitable value. In addition to reducing the likelihood of overfitting, the start-small method guarantees faster execution than the start-large method.

Add Dropout Layers

In addition to dense layers, I recommend that you add dropout layers to your neural networks. A dropout layer sets a percentage of its inputs to 0 before passing the signals as output. This reduces the likelihood of overfitting by reducing the codependency of the inputs entering the dropout layer.

In TensorFlow, you can create a dropout layer by calling tf.nn.dropout. This layer accepts a tensor whose values identify the probability that the corresponding input should be discarded.

Train with Small, Random Batches

After you preprocess your data, initialize your weights, and determine the initial structure of your neural network, you’re ready to start training. Rather than train with the entire dataset at once, you should split your data into batches. The neural network will update its gradients and weights with each batch processed.

Reducing the batch size increases the training time, but it also decreases the likelihood that the optimizer will settle into a local minimum instead of finding the global minimum. It also reduces the dependence of the analysis on the order of the samples. You can reduce this dependence further by shuffling batches as training proceeds.

Normalize Batch Data

Even if you standardize the samples entering your neural network, the mean and variance of your data will change as it moves from one hidden layer to the next. For this reason, developers normalize the data as it leaves each layer.

This normalization involves setting the mean to zero and the standard deviation to one. But the process is slightly more complicated because you need to approximate the mean and variance of the entire batch. Rather than do the math yourself, call tf.contrib.layers.batch_norm.

Try Different Optimization Algorithms

Your choice of optimizer will play a critical role in determining the accuracy and performance of your application. Which optimization method is best? Despite decades of analysis, researchers haven’t reached a consensus.

Start with the Adam and Adagrad optimizers, but if you’re not getting the performance and accuracy you want, it’s a good idea to try other methods. In a TensorFlow application, you set the optimization method by creating an instance of an optimizer class, such as tf.train.AdamOptimizer, calling its minimize method, and running the returned operation in a session.

Set the Right Learning Rate

An optimizer’s learning rate determines how an optimizer updates its weights with each training step. If you set the learning rate too high, the optimizer will make dramatic changes to the weights, and it may never converge to a solution. If you set the learning rate too low, the optimizer will proceed slowly, and it may converge to a local minimum instead of a global minimum.

Typical learning rates vary from 0.0001 to 0.5, but the best learning rate varies from application to application. I recommend starting with a high value and repeatedly reducing the learning rate until you’re satisfied with the application’s accuracy and performance.

Check Weights and Gradients

Machine learning applications frequently fail because the weights drop to zero (the vanishing gradient problem) or grow very large (the exploding gradient problem). In both cases, you may need to adjust the number of layers in your network and/or the activation function of each layer.

Thankfully, TensorFlow lets you save a layer’s weights and visualize the weights with TensorBoard.