How to chose an activation function for your network


This is the third post in the optimization series, where we are trying to give the reader a comprehensive review of optimization in deep learning. So far, we have looked at how:

This is a companion discussion topic for the original entry at


Thanks for the awesome blog post.

In the blog post, you mentioned that

The inputs which contain the concept to varying degrees produce a variance in the positive output of the neuron.

I am unable to understand why are negative activation considered noise. Could you elaborate?


Me too didn’t get it properly. But what I could make out of it is that if a node in a particular layer learns a particular aspect of a feature wouldn’t it be better for it to just reject all other aspects, they might not necessarily be noise but for that particular node learning values of those non-target aspects in negative is not a good choice.


Chinese translation:

Possible typos:

Let us assume that we have a neurons values of which are unbounded


This can cause a zig zag patter in search of minima


people have taken the approach further by not randomly sampling the negative slope α but turning it into a hyperparameter, which is learned by the network during training.


\alpha is a hyperparameter that is normally chosen to be 1


my next choice is a either a Leaky ReLU or a ELU