Intro to optimization in deep learning: Momentum, RMSProp and Adam


In another post, we covered the nuts and bolts of Stochastic Gradient Descent and how to address problems like getting stuck in a local minima or a saddle point. In this post, we take a look at another problem that plagues training of neural networks, pathological curvature.

This is a companion discussion topic for the original entry at


Chinese translation:

Possible typos:

which means the surface , which means a surface is rapidly getting less steeper as we move.

should be deleted

The epsilon is equation 2, is to ensure that we do not end up dividing by zero


we multiply our learning rate by average of the gradient (as was the case with momentum) and divide it by the root mean square of the exponential average of square of gradients (as was the case with momentum)



In the equation for adam, vt and st have negative signs in between but should have been positive. The RMSprop’s equation has positive sign. Is it a glitch?


It looks like your \Delta w_t for Adam has an extra * g_t


The article is brilliant. Thanks :slightly_smiling_face:
Just one doubt in the derivation of gradient descent with momentum- if i am not wrong, there is a component of gradient in the downward(z) direction as well right? That means that the weights will also be taking some steps downwards in addition to the zig zag motion in the x-y plane…
Also i think there is a correction in adam formulae, you have an extra gt term in the update rule.


Great article, than you! It helped me a lot to understand, was much clearer that most tutos/explanations found in the net, detailled but easily understandable. Good pedagogical work.

However I’n not sure to understand, in the first 2 equations of ADAM (v(t) and s(t) definitions), why do we substract (1-beta)*gt_term? Whereas in RMSProp we add this term, not substract it.
Is this a typo? Or is there a reason we change this sign in ADAM?