Activation functions

4 min readAug 12, 2020

To understand better this concept we need to recall what does a neuron does. To put it in a simple way it calculates a weighted sum of its input then it adds the bias and decides whether the neuron should be activated or not.

The following is the formula to get what we talked up above.

The value of y can be anything ranging from -inf to inf and this is where the activation function does its job. It’d help us to know if the neuron should be fired or not (fired = activated).

But when does it get activated? well, to put it in a simple formula we can say this

Activation function A = 1 if y > threshold, 0 otherwise

It gets activated when y is greater than the threshold and 0 if it’s not (This is normally considered in the binary step function which we’ll discuss later)

Now, let’s talk about what we came from. Yes, activation functions, many activation functions are being used in Machine Learning practices but here we’ll check the most common ones.

Binary step function

Do you remember the activation formula we did up above? That is what binary function is about, if the input value is greater or lower than a certain threshold, being greater than the neuron gets fired and sends the same signal to the next layer.

It’s useful and straightforward to classify binary problems, the trouble about this function comes when it has to classify the input into one of several categories since it cannot receive multiple-value outputs.

Linear function

It is a straight line activation function which takes the inputs, multiplied by the weights for each neuron, and creates an output signal proportional to the input, this could be better than binary step function since they allow multiple-value outputs

The main problem with this activation is when you use it, it does not matter how many layers you have you’ll end up having a linear function since a linear combination of linear functions is still a linear function turning a neural network into just one layer. Also, this function does not allow backpropagation since it does not have a derivative function.

Sigmoid function

The sigmoid function takes the following form:

This function allows us to normalize the output of each neuron being these between 0 and 1, it has also clearest predictions for example when X is above 2 or below -2, it tends to bring the activation value to the edge of the curve which is close 1 or 0 respectively.

On the other hand, we have some mix-ups with this kind of function. The first one and the most important one is that it’s computationally expensive, the second one is that it causes the vanishing gradient problem when X is a very high or low value which can make it slower at reaching an accurate prediction, also it wouldn’t allow the network to learn further. And the last but not less important problem is that outputs won’t be zero centered.

Tanh function (Hyperbolic tangent)

Tanh function is zero centered which makes less effortless to model inputs that have robustly negative, neutral and positive values

This function just like the sigmoid one has the two following disadvantages.

It is computationally expensive
The vanishing gradient problem

ReLU (Rectified Linear Unit)

Opposite to sigmoid and Tanh, ReLU is computationally cheaper and efficient allowing the network to converge very quickly.

Even though ReLU seems to be a linear function it is not, having a derivative function allows it backpropagation.

The dying ReLU problem, which says that when inputs get really close to 0 or are negative values, the gradient of this function becomes 0 so the network cannot do backpropagation and cannot learn.

Softmax function

With softmax we get able to handle multiple classes, it normalizes the outputs for each class between 0 and 1 and divides by their sum, obtaining the probability of the input value in a specific class.