Unveiling the Dropout Layer: An Essential Tool for Enhancing Neural Networks | by Niklas Lang | May, 2023

A.I. Black GuyMay 19, 2023

0 1 5 minutes read

Unveiling the Dropout Layer: An Essential Tool for Enhancing Neural Networks | by Niklas Lang | May, 2023

Understanding the Dropout Layer: Improving Neural Network Training and Reducing Overfitting with Dropout Regularization

The dropout layer is a layer used in the construction of neural networks to prevent overfitting. In this process, individual nodes are excluded in various training runs using a probability, as if they were not part of the network architecture at all.

However, before we can get to the details of this layer, we should first understand how a neural network works and why overfitting can occur.

The perceptron is a mathematical model inspired by the structure of the human brain. It consists of a single neuron that receives numerical inputs with different weights. The inputs are multiplied by their weights and summed up, and the result is passed through an activation function. In its simplest form, the perceptron produces binary outputs, such as “Yes” or “No,” based on the activation function. The sigmoid function is commonly used as an activation function, mapping the weighted sum to values between 0 and 1. If the weighted sum exceeds a certain threshold, the output transitions from 0 to 1.

Basic Structure of a Perceptron | Source: Author

For a more detailed look into the concept of perceptrons, feel free to refer to this article:

Overfitting occurs when a predictive model becomes too specific to the training data, learning both the patterns and noise present in the data. This results in poor generalization and inaccurate predictions on new, unseen data. Deep neural networks are particularly susceptible to overfitting as they can learn the statistical noise of the training data. However, abandoning complex architectures is not desirable, as they enable learning complex relationships. The introduction of dropout layers helps address overfitting by providing a solution to balance model complexity and generalization.

Overfitting vs. Generalisation — Difference between Generalisation and Overfitting | Source: Author

For a more detailed article on overfitting please refer to our article on the topic:

With dropout, certain nodes are set to the value zero in a training run, i.e. removed from the network. Thus, they have no influence on the prediction and also in the backpropagation. Thus, a new, slightly modified network architecture is built in each run and the network learns to produce good predictions without certain inputs.

When installing the dropout layer, a so-called dropout probability must also be specified. This determines how many of the nodes in the layer will be set equal to 0. If we have an input layer with ten input values, a dropout probability of 10% means that one random input will be set equal to zero in each training pass. If instead, it is a hidden layer, the same logic is applied to the hidden nodes. So a dropout probability of 10% means that 10% of the nodes will not be used in each run.

The optimal probability also depends strongly on the layer type. As various papers have found, for the input layer, a dropout probability close to one is optimal. For hidden layers, on the other hand, a probability close to 50% leads to better results.

In deep neural networks, overfitting usually occurs because certain neurons from different layers influence each other. Simply put, this leads, for example, to certain neurons correcting the errors of previous nodes and thus depending on each other or simply passing on the good results of the previous layer without major changes. This results in comparatively poor generalization.

By using the dropout layer, on the other hand, neurons can no longer rely on the nodes from previous or subsequent layers, since they cannot assume that they even exist in that particular training run. This leads to neurons, provably, recognizing more fundamental structures in data that do not depend on the existence of individual neurons. These dependencies actually occur relatively frequently in regular neural networks, as this is an easy way to quickly reduce the loss function and thereby quickly get closer to the goal of the model.

Also, as mentioned earlier, the dropout slightly changes the architecture of the network. Thus, the trained-out model is then a combination of many, slightly different models. We are already familiar with this approach from ensemble learning, such as in Random Forests. It turns out that the ensemble of many, relatively similar models usually gives better results than a single model. This phenomenon is known as the “Wisdom of the Crowds”.

In practice, the dropout layer is often used after a fully-connected layer, since this has comparatively many parameters and the probability of so-called “co-adaptation”, i.e. the dependence of neurons on each other, is very high. However, theoretically, a dropout layer can also be inserted after any layer, but this can then also lead to worse results.

Practically, the dropout layer is simply inserted after the desired layer and then uses the neurons of the previous layer as inputs. Depending on the value of the probability, some of these neurons are then set to zero and passed on to the subsequent layer.

It is particularly useful to use the dropout layers in larger neural networks. This is because an architecture with many layers tends to overfit much more strongly than smaller networks. It is also important to increase the number of nodes accordingly when a dropout layer is added. As a rule of thumb, the number of nodes before the introduction of the dropout is divided by the dropout rate.

As we have now established, the use of a dropout layer during training is an important factor in avoiding overfitting. However, the question remains whether this system is also used when the model has been trained and is then used for predictions for new data.

In fact, the dropout layers are no longer used for predictions after training. This means that all neurons remain for the final prediction. However, the model now has more neurons available than it did during training. However, as a result, the weights in the output layer are significantly higher than what was learned during training. Therefore, the weights are scaled with the amount of the dropout rate so that the model still makes good predictions.

For Python, there are already many predefined implementations with which you can use dropout layers. The best-known is probably that of Keras or TensorFlow. You can import these, like other layer types, via “tf.keras.layers”:

Then you pass the parameters, i.e. on the one hand the size of the input vector and the dropout probability, which you should choose depending on the layer type and the network structure. The layer can then be used by passing actual values in the variable “data”. There is also the parameter “training”, which specifies whether the dropout layer is only used in training and not in the prediction of new values, the so-called inference.

If the parameter is not explicitly set, the dropout layer will only be active for “model.fit()”, i.e. training, and not for “model.predict()”, i.e. predicting new values.

A dropout is a layer in a neural network that sets neurons to zero with a defined probability, i.e. ignores them in a training run.In this way, the danger of overfitting can be reduced in deep neural networks, since the neurons do not form a so-called adaptation among themselves, but recognize deeper structures in the data.The dropout layer can be used in the input layer as well as in the hidden layers. However, it has been shown that different dropout probabilities should be used depending on the layer type.However, once the training has been trained out, the dropout layer is no longer used for predictions. However, in order for the model to continue to produce good results, the weights are scaled using the dropout rate.

Source link