Skip to content

Weight decay in AdamW #38853

@jeongukjae

Description

@jeongukjae

🐛 Bug

As the original paper(https://arxiv.org/pdf/1711.05101.pdf, green boxes) shows

Screen Shot 2020-05-21 at 12 40 39 PM

the formula of applying weight decay to Adam should be
\theta_t = (1 - \lambda) * \theta_{t - 1} or
\theta_t = (1 - {schedule multiplier} * \lambda) * \theta_{t - 1}.

But, AdamW implementation in master branch (https://github.com/pytorch/pytorch/blob/master/torch/optim/adamw.py#L73) applies weight decay with learning rate.

\theta_t = (1 - \lambda * {learning rate}) * \theta_{t - 1}

Expected behavior

I think this line(

p.mul_(1 - group['lr'] * group['weight_decay'])
) should be p.mul_(1 - group['weight_decay']).

cc @vincentqb

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: optimizerRelated to torch.optimtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions