Key concepts of neural network optimization

Brayan Florez
5 min readAug 24, 2020

--

Before starting I might say that this blog would be the shortest possible to have a better understanding of each concept we’ll check. By saying that, let’s start!

Hyperparameters

A hyperparameter is a parameter that is set before the training process begins. These parameters are tunable and can affect how well a model is trained.

Machine Learning Hyperparameter Search - Quantum Computing

Feature scaling

It’s one of the most essential parts of the data preprocessing stage, simply it improves the performance of some machine learning algorithms. We use feature scaling to ensure that gradient descent based algorithms like linear regression, neural network, deep learning, and more move towards the minima in the smoothest way possible by updating at the same rate all the features.

We scale the data before feeding it to the model since having all features on a similar scale can help these algorithms to converge much more quickly towards the minima.

The two most common methods of feature scaling are standardization and normalization.

Standardization (Z-score normalization)

It’s a technique where values are centered around the mean which becomes 0 with a unit standard deviation.

In this scaling method, values are not restricted in a range. Normally, standardization is used when the data follows a Gaussian distribution.

Normalization

It’s a technique in which values are shifted and rescaled so they end up in a range between [0, 1] or [-1, 1]

Unlike standardization, normalization is used when the data does not follow a Gaussian distribution.

Pros:

  • Gradient descent will converge much faster.
  • Better performance in neural networks.

Cons:

  • Won’t guarantee better results.

Batch normalization

It’s a technique that can be used to normalize inputs to a layer making the network more stable during the training.

Using it, rather than just performing normalization in the beginning. It’s applied all over the network using beta and gamma hyperparameters to avoid squeezing values.

By applying batch normalization it’s possible to use larger learning rates which will speed up the learning process without a side effect. It can also make the training deep network less sensitive to the choice of weight initialization data.

It may be not appropriate to use when data does not follow a Gaussian distribution.

Pros:

  • Less sensitive to the weight initialization.
  • It can use larger learning rates to speed the learning process up.

Cons:

  • Slower predictions since it’s extra computed at each layer.

Mini-batch gradient descent

It is a variation of gradient descent algorithm in which the training data set is split into small batches that are used to calculate model error and update model coefficients reducing the variance of the gradient.

The batched updates provide a computationally more efficient process as we see above allows a robust convergence.

Mini-batch algorithm requires a hyperparameter “batch_size” for learning and also the error can accumulate across mini-batches.

Pros:

  • It is computationally efficient.
  • Easily fits in the memory.
  • Faster learning.
  • Stable convergence.

Cons:

  • Having a hyperparameter as I wrote before

Gradient descent with momentum

It computes an exponentially weighted average of gradients, and then use that gradient to update weights instead. It works faster than the standard gradient descent algorithm. Momentum is a moving average of gradients.

The algorithm will end up at local optima with a few iterations.

Pros:

  • Faster learning.

Cons:

  • It is possible to miss the local minima.

RMSProp (Root Mean Square Propogation) optimization

It exposes the same concept of the exponentially weighted average of the gradients like gradient descent with momentum but the difference is the update of parameters.

Pros:

  • It is able to converge to solutions with better quality.
  • It converges faster than SGD.

Adam optimization

It can be used instead of stochastic gradient descent to update network weights iterative based on training data.

When introducing the algorithm, the authors list the attractive benefits of using Adam on non-convex optimization problems, as follows:

  • Straightforward to implement.
  • Computationally efficient.
  • Little memory requirements.
  • Invariant to diagonal rescale of the gradients.
  • Well suited for problems that are large in terms of data and/or parameters.
  • Appropriate for non-stationary objectives.
  • Appropriate for problems with very noisy/or sparse gradients.
  • Hyper-parameters have intuitive interpretation and typically require little tuning.

Learning rate decay

It is a technique for training neural networks which consist of starting with a large learning rate and then decays multiple times recalculating it.

First, with a large learning rate is possible to accelerate training then, decaying the learning rate will help the network converge to the local minima avoiding oscillation. Also is good to mention that an initially large learning rate suppresses the network from memorizing noisy data while decaying the learning rate improves the learning of complex patterns.

It is computed as:

Thanks for reading it!

--

--