Intro to optimization in deep learning: Momentum, RMSProp and Adam


In another post, we covered the nuts and bolts of Stochastic Gradient Descent and how to address problems like getting stuck in a local minima or a saddle point. In this post, we take a look at another problem that plagues training of neural networks, pathological curvature.

This is a companion discussion topic for the original entry at


Chinese translation:

Possible typos:

which means the surface , which means a surface is rapidly getting less steeper as we move.

should be deleted

The epsilon is equation 2, is to ensure that we do not end up dividing by zero


we multiply our learning rate by average of the gradient (as was the case with momentum) and divide it by the root mean square of the exponential average of square of gradients (as was the case with momentum)



In the equation for adam, vt and st have negative signs in between but should have been positive. The RMSprop’s equation has positive sign. Is it a glitch?