Adam, AdamW Optimizers

Home

The most popular optimizer in neural networks is the Adam Optimizer.

https://arxiv.org/abs/1412.6980 [Adam: A Method for Stochastic Optimization (2014)]

It combines momentum and scales the learning rate separately for each parameter like RMSProp.

https://arxiv.org/pdf/1711.05101.pdf [Decoupled Weight Decay Regularization (2019)]

extends Adam by incorporating L2 Regularization and weight decay.

Adam also performs bias correction to discount early on updates.

is the gradient, is the momentum term, is the RMSProp term.

is the learning rate, is the learning rate scheduler, which can be set to 1 or not.

is the regularization parameter.

Typical values for parameters include

Notice how we can simplify the update equation by bringing the divisor out.

Likewise, to take advantage of the fast reciprocal square root, we need to fuse the epsilon term.

Notice how we first prove the triangle inequality.

So, by storing variables in memory, we have that

Also, notice the exponent for bias correction starts from 1 and not 0!

Likewise, notice is essentially just since a small number squared is even smaller.

Keep at machine resolution ie 1e-7 or so for float32.

So, by using batch gradient descent, we have the final algorithm:

Copyright Daniel Han 2024. Check out Unsloth!