-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Description
🚀 The feature, motivation and pitch
Adam is implemented in https://github.com/pytorch/pytorch/blob/master/torch/optim/adam.py#L259 using the following effective formula (terminology from the paper and the code with light abbreviations):
theta = lr / bias1 * m / (sqrt(v) / bias2 + eps)
This formulation requires an additional division of the full-sized tensor sqrt(v) by bias2. Presumably it was done this way to have eps not be implicitly affected by bias2. However, we can still achieve this without the costly sqrt(v)/bias2 op - multiply & divide eps by bias2 and then factor out 1/bias2 in the denominator and you get:
theta = lr * bias2 / bias1 * m / (sqrt(v) + bias2 * eps)
which should be incrementally faster to compute.
Note: the current formulation permeates the Adam and Adam-esque implementations, so any update should presumably be propagated across them.
Alternatives
N/a
Additional context
I've confirmed that the current formulation does indeed induce an extra div op in the Adam step:
This was done on pytorch v1.10, and the code has been reorganized since so the precise file & line reference is no longer accurate, but the relevant piece of code hasn't actually changed, it's still there on https://github.com/pytorch/pytorch/blob/master/torch/optim/adam.py#L259 as of this writing.
