Regularization

Brayan Florez
4 min readAug 29, 2020
Why the Regularization used In Machine Learning? – mc.ai

Before starting to describe some regularization techniques, let’s check out what regularization means. To make it simple, regularization is a technique that aims to solve the overfitting problem (when we have a high variance) which does not allow the model to work properly with unseen data.

L1 and L2 are the most common regularization techniques used. Both of these techniques update the cost function by adding to it a regularization term which depends on either L1 or L2.

Cost function = Loss (binary cross entropy) + Regularization term

With the addition of the regularization term weights’ values decrease, a neural network with smaller weights conduct to more manageable models. This decreases the overfitting reducing the variance and increasing just a bit the bias without hurting it so much which is what we want.

Regularization terms for both L1 and L2 have a parameter which is called lambda (λ). It is a hyperparameter whose value ranges from 0 to ∞, it gets optimized for better results.

L1 (Lasso) Regularization

Improves the prediction error by shrinking all the way down to 0 some coefficients to reduce overfitting. Meaning that we can avoid all features with a really small value and avoid those irrelevant to our training set.

L2 (Ridge Regression) Regularization

Just as L1, it improves the prediction error by shrinking coefficients, the big difference between them is that L2 does not shrink values down to 0 but it gets some values close to 0.

The following is a short comparison table between L1 and L2.

Dropout

The term dropout refers to dropping out nodes in any or all hidden layers in the network as well as the input layer, it is not applied in the output layer.

As it is shown in the image above after every iteration it chooses randomly some nodes and removes them with all the connections between them. It is usually preferred when having a large neural network because of the randomness. Doing this for every training example gives us different models for each one. Subsequently, when we want to test our model, we take the average of each model to get the prediction.

Dropout can be used in all network types, worth noting that a good value for dropout in a hidden layer ranges [0.5, 0.8], in the input layer it’s common to use a larger value like 0.8.

Data augmentation

It is a strategy that allows us to increase the diversity of data available for training models, without actually collecting new data. Wait, what? Let me add an example easier to understand.

The same image can provide us different data by changing its perspective flipping, rotating, scaling, cropping, adding some noise, and translating it.

“It can be considered as a mandatory trick in order to improve our predictions

Early Stopping

It acts like a trigger that allows us to stop training the model as soon as the performance on the testing error starts increasing compared to the performance on the training error.

By doing that we can prevent that our model gets overtrained.

Check out the following code in Python to determine if you should do early stopping. Here early stopping should occur when the validation cost of the network has not decreased relative to the optimal validation cost by more than the threshold over a specific patience count.

Hope this article might help you in understanding how we can avoid overfitting in our model.

--

--