-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Open
Labels
module: optimizerRelated to torch.optimRelated to torch.optimtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
🐛 Bug
As the original paper(https://arxiv.org/pdf/1711.05101.pdf, green boxes) shows
the formula of applying weight decay to Adam should be
\theta_t = (1 - \lambda) * \theta_{t - 1} or
\theta_t = (1 - {schedule multiplier} * \lambda) * \theta_{t - 1}.
But, AdamW implementation in master branch (https://github.com/pytorch/pytorch/blob/master/torch/optim/adamw.py#L73) applies weight decay with learning rate.
\theta_t = (1 - \lambda * {learning rate}) * \theta_{t - 1}
Expected behavior
I think this line(
Line 73 in a8d8fc5
| p.mul_(1 - group['lr'] * group['weight_decay']) |
p.mul_(1 - group['weight_decay']).
cc @vincentqb
yuhaozhang, harrydrippin, mingtop, zhb2000, tam17aki and 1 more
Metadata
Metadata
Assignees
Labels
module: optimizerRelated to torch.optimRelated to torch.optimtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
