Adam, AdamW Optimizers

Home

The most popular optimizer in neural networks is the Adam Optimizer.

https://arxiv.org/abs/1412.6980 [Adam: A Method for Stochastic Optimization (2014)]

It combines momentum and scales the learning rate separately for each parameter like RMSProp.

https://arxiv.org/pdf/1711.05101.pdf [Decoupled Weight Decay Regularization (2019)]

extends Adam by incorporating L2 Regularization and weight decay.

 

 

Adam also performs bias correction to discount early on updates.

 

 

is the gradient, is the momentum term, is the RMSProp term.

is the learning rate, is the learning rate scheduler, which can be set to 1 or not.

is the regularization parameter.

Typical values for parameters include

 

Notice how we can simplify the update equation by bringing the divisor out.

 

 

 

Likewise, to take advantage of the fast reciprocal square root, we need to fuse the epsilon term.

Notice how we first prove the triangle inequality.

 

 

 

So, by storing variables in memory, we have that

 

 

Also, notice the exponent for bias correction starts from 1 and not 0!

Likewise, notice is essentially just since a small number squared is even smaller.

Keep at machine resolution ie 1e-7 or so for float32.

So, by using batch gradient descent, we have the final algorithm:

 

 

 

 

 

 

(c) Copyright Protected: Daniel Han-Chen 2020

License: All content on this page is for educational and personal purposes only.

Usage of material, concepts, equations, methods, and all intellectual property on any page in this publication

is forbidden for any commercial purpose, be it promotional or revenue generating. I also claim no liability

from any damages caused by my material. Knowledge and methods summarized from various sources like

papers, YouTube videos and other mediums are protected under the original publishers licensing arrangements.

Home