Weight decay in AdamW

## 🐛 Bug

As the original paper(https://arxiv.org/pdf/1711.05101.pdf, green boxes) shows 

<img width="604" alt="Screen Shot 2020-05-21 at 12 40 39 PM" src="https://user-images.githubusercontent.com/8815362/82520936-49782e00-9b60-11ea-869e-1d9e06bbea26.png">

the formula of applying weight decay to Adam should be 
`\theta_t = (1 - \lambda) * \theta_{t - 1}` or
`\theta_t = (1 - {schedule multiplier} * \lambda) * \theta_{t - 1}`.

But, AdamW implementation in master branch (https://github.com/pytorch/pytorch/blob/master/torch/optim/adamw.py#L73) applies weight decay with learning rate.

`\theta_t = (1 - \lambda * {learning rate}) * \theta_{t - 1}`

## Expected behavior

I think this line(https://github.com/pytorch/pytorch/blob/a8d8fc553229731c2ca491fefe18ff977c7e8af0/torch/optim/adamw.py#L73) should be `p.mul_(1 - group['weight_decay'])`.


cc @vincentqb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Weight decay in AdamW #38853

🐛 Bug

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Weight decay in AdamW #38853

Description

🐛 Bug

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions