Skip to content

Adam not optimally implemented: unnecessary torch.div #79352

@jxtps

Description

@jxtps

🚀 The feature, motivation and pitch

Adam is implemented in https://github.com/pytorch/pytorch/blob/master/torch/optim/adam.py#L259 using the following effective formula (terminology from the paper and the code with light abbreviations):

theta = lr / bias1 * m / (sqrt(v) / bias2 + eps)

This formulation requires an additional division of the full-sized tensor sqrt(v) by bias2. Presumably it was done this way to have eps not be implicitly affected by bias2. However, we can still achieve this without the costly sqrt(v)/bias2 op - multiply & divide eps by bias2 and then factor out 1/bias2 in the denominator and you get:

theta = lr * bias2 / bias1 * m / (sqrt(v) + bias2 * eps)

which should be incrementally faster to compute.

Note: the current formulation permeates the Adam and Adam-esque implementations, so any update should presumably be propagated across them.

Alternatives

N/a

Additional context

I've confirmed that the current formulation does indeed induce an extra div op in the Adam step:

image

This was done on pytorch v1.10, and the code has been reorganized since so the precise file & line reference is no longer accurate, but the relevant piece of code hasn't actually changed, it's still there on https://github.com/pytorch/pytorch/blob/master/torch/optim/adam.py#L259 as of this writing.

cc @VitalyFedyunin @ngimel @vincentqb @jbschlosser @albanD

Metadata

Metadata

Assignees

No one assigned

    Labels

    actionablemodule: optimizerRelated to torch.optimmodule: performanceIssues related to performance, either of kernel code or framework gluetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions