Struggling with Overfitting in Machine Learning

By Nikhil Abraham

Given the neural network architecture, you can imagine how easily the algorithm could learn almost anything from data, especially if you added too many layers. In fact, the algorithm does so well that its predictions are often affected by a high estimate variance called overfitting. Overfitting causes the neural network to learn every detail of the training examples, which makes it possible to replicate them in the prediction phase. But apart from the training set, it won’t ever correctly predict anything different.

Understanding the overfitting problem

When you use a neural network for a real problem, you have to take some cautionary steps in a much stricter way than you do with other algorithms. Neural networks are frailer and more prone to relevant errors than other machine learning solutions.

First, you carefully split your data into training, validation, and test sets. Before the algorithm learns from data, you must evaluate the goodness of your parameters: architecture (the number of layers and nodes in them); activation functions; learning parameter; and number of iterations. In particular, the architecture offers great opportunities to create powerful predictive models at a high risk of overfitting. The learning parameter controls how fast a network learns from data, but it may not suffice in preventing overfitting the training data.

You have two possible solutions to this problem:

  • The first solution is regularization, as in linear and logistic regression. You can sum all connection coefficients, squared or in absolute value, to penalize models with too many coefficients with high values (achieved by L2 regularization) or with values different from zero (achieved by L1 regularization).
  • The second solution is also effective because it controls when overfitting happens. It’s called early-stop and works by checking the cost function on the validation set as the algorithm learns from the training set.

You may not realize when your model starts overfitting. The cost function calculated using the training set keeps improving as optimization progresses. However, as soon as you start recording noise from the data and stop learning general rules, you can check the cost function on an out-of-sample (the validation sample). At some point, you’ll notice that it stops improving and starts worsening, which means that your model has reached its learning limit.

Opening the black box of neural networks

The best way to learn how to build a neural network is to build one. Python offers a wealth of possible implementations for neural networks and deep learning. Python has libraries such as Theano, which allows complex computations at an abstract level, and more practical packages, such as Lasagne, which allows you to build neural networks, though it still requires some abstractions. For this reason, you need wrappers, such as nolearn, which is compatible with scikit-learn, or Keras, which can also wrap the TensorFlow library released by Google that has the potential to replace Theano as a software library for neural computation.

R provides libraries that are less complicated and more accessible, such as nnet, AMORE, and neuralnet. These brief examples in R show how to train both a classification network (on the Iris data set) and a regression network (on the Boston data set). Starting from classification, the following code loads the data set and splits it into training and test sets:



target <- model.matrix( ~ Species&amp;#x00A0;- 1, data=iris )

colnames(target) <- c("setosa", "versicolor", "virginica")


index <- sample(1:nrow(iris), 100)


train_predictors <- iris[index, 1:4]

test_predictors <- iris[-index, 1:4]

Because neural networks rely on gradient descent, you need to standardize or normalize the inputs. Normalizing is better so that the minimum is zero and the maximum is one for every feature. Naturally, you learn how to make the numeric conversion using the training set only in order to avoid any chance of using information from the test out-of-sample.

min_vector <- apply(train_predictors, 2, min)

range_vector <- apply(train_predictors, 2, max) -

apply(train_predictors, 2, min)


train_scaled <- cbind(scale(train_predictors,

min_vector, range_vector),



test_scaled <- cbind(scale(test_predictors,

min_vector, range_vector),




When the training set is ready, you can train the model to guess three binary variables, with each one representing a class. The output is a value for each class proportional to its probability of being the real class. You pick a prediction by taking the highest value. You can also visualize the network by using the internal plot and thus seeing the neural network architecture and the assigned weights.


nn_iris <- neuralnet(setosa + versicolor + virginica ~

Sepal.Length + Sepal.Width

+ Petal.Length + Petal.Width,

data=train_scaled, hidden=c(2),





predictions <- compute(nn_iris, test_scaled[,1:4])

y_predicted <- apply(predictions$net.result,1,which.max)

y_true <- apply(test_scaled[,5:7],1,which.max)

confusion_matrix <- table(y_true, y_predicted)

accuracy <- sum(diag(confusion_matrix)) /


print (confusion_matrix)

print (paste("Accuracy:",accuracy))

trained neural network in machine learning
You can plot a trained neural network.

The following example demonstrates how to predict house values in Boston, using the Boston data set. The procedure is the same as in the previous classification, but here you have a single output unit. The code also plots the test set’s predicted results against the real values to verify the good fit of the model.

no_examples <- nrow(Boston)

features <- colnames(Boston)



index <- sample(1:no_examples, 400)


train <- Boston[index,]

test <- Boston[-index,]


min_vector <- apply(train,2,min)

range_vector <- apply(train,2,max) - apply(train,2,min)

scaled_train <- scale(train,min_vector,range_vector)

scaled_test <- scale(test, min_vector,range_vector)


formula = paste("medv ~", paste(features[1:13],


nn_boston <- neuralnet(formula, data=scaled_train,

hidden=c(5,3), linear.output=T)

predictions <- compute(nn_boston, scaled_test[,1:13])

predicted_values <- (predictions$net.result *

range_vector[14]) + min_vector[14]

RMSE <- sqrt(mean((test[,14] - predicted_values)^2))

print (paste("RMSE:",RMSE))

plot(test[,14],predicted_values, cex=1.5)