Intro to optimization in deep learning: Momentum, RMSProp and Adam


#1

In another post, we covered the nuts and bolts of Stochastic Gradient Descent and how to address problems like getting stuck in a local minima or a saddle point. In this post, we take a look at another problem that plagues training of neural networks, pathological curvature.


This is a companion discussion topic for the original entry at https://blog.paperspace.com/intro-to-optimization-momentum-rmsprop-adam/

#2

Chinese translation: https://www.jqr.com/article/000505

Possible typos:

which means the surface , which means a surface is rapidly getting less steeper as we move.

should be deleted

The epsilon is equation 2, is to ensure that we do not end up dividing by zero

in

we multiply our learning rate by average of the gradient (as was the case with momentum) and divide it by the root mean square of the exponential average of square of gradients (as was the case with momentum)

RMSProp


#3

In the equation for adam, vt and st have negative signs in between but should have been positive. The RMSprop’s equation has positive sign. Is it a glitch?


#4

It looks like your \Delta w_t for Adam has an extra * g_t


#5

The article is brilliant. Thanks :slightly_smiling_face:
Just one doubt in the derivation of gradient descent with momentum- if i am not wrong, there is a component of gradient in the downward(z) direction as well right? That means that the weights will also be taking some steps downwards in addition to the zig zag motion in the x-y plane…
Also i think there is a correction in adam formulae, you have an extra gt term in the update rule.