How to chose an activation function for your network


#1

This is the third post in the optimization series, where we are trying to give the reader a comprehensive review of optimization in deep learning. So far, we have looked at how:


This is a companion discussion topic for the original entry at https://blog.paperspace.com/vanishing-gradients-activation-function/

#2

Thanks for the awesome blog post.

In the blog post, you mentioned that

The inputs which contain the concept to varying degrees produce a variance in the positive output of the neuron.

I am unable to understand why are negative activation considered noise. Could you elaborate?


#3

Me too didn’t get it properly. But what I could make out of it is that if a node in a particular layer learns a particular aspect of a feature wouldn’t it be better for it to just reject all other aspects, they might not necessarily be noise but for that particular node learning values of those non-target aspects in negative is not a good choice.


#4

Chinese translation: www.jqr.com/article/000526

Possible typos:

Let us assume that we have a neurons values of which are unbounded

neurons

This can cause a zig zag patter in search of minima

pattern

people have taken the approach further by not randomly sampling the negative slope α but turning it into a hyperparameter, which is learned by the network during training.

parameter

\alpha is a hyperparameter that is normally chosen to be 1

α

my next choice is a either a Leaky ReLU or a ELU

either