-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Add decoupled weight decay factory to create optimizerWs. #23158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… Instantiates AdamW and SGDW.
vincentqb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this!
As discussed in #22343, this pull request adds a factory function to create weight decay optimizers (Loshchilov & Hutter), reducing code duplication and allowing to extend arbitrary optimizers with decoupled weight decay.
The duplication of code between SGD/W and Adam/W is not necessarily bad -- and the versions are very easy to compare with a simple diff. However, I do agree that, if more and more optimizers want the weight decay, it would not be sustainable. We also do want a standard way of introducing weight (or other parameters) schedulers (and I'll link back here to reference back on schedulers' design).
Includes AdamW and SGDW by default.
Note: This implementation does not multiply by learning rate and is therefore blocked by #22343.
We will need to make sure the current AdamW from #21250 and this new version agree under reasonable conditions, so we can supersede the current version with this one.
Note: The current implementation only works for optimizers that use the gradient but not the value of the parameter to optimize as the weight decay is applied before the optimizer update. If desired, I can extend the factory with a boolean
optimizer_requires_value, which forces decay of the variable as the last step (requiring a copy of the parameter).
I would indeed expect a weight scheduler to be applied at the end. Which step makes you require the copy of the parameter when moving the weight decay at the end?
docs/source/optim.rst
Outdated
| :members: | ||
| .. autoclass:: AdamW | ||
| :members: | ||
| .. autoclass:: AdamW2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is AdamW2?
There is an implementation of AdamW in PyTorch since #21250. In the version implemented, the decay is taken from theta, but the weight and lr schedulers are still coupled. I would expect the new implementation to supersede the current one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, AdamW2 was a relict from comparing the old with the new optimizer.
I've now copied the same update step to the wrapper.
| def __init__(self, decoupled_weight_decay, *args, **kwargs): | ||
| super(DecoupledWeightDecayOptimizer, self).__init__(*args, **kwargs) | ||
| if self.defaults["weight_decay"] != 0: | ||
| warnings.warn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do people want to do both at the same time? I suspect people understand AdamW and SGDW as not having L2 anymore. If so, I'd suggest to simply biggy back on the parameter provided in Adam. There shouldn't be confusion when someone invokes AdamW which regularization they want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed the the input argument to weight_decay. Internally however, it still needs to be called differently, otherwise the optimizers' update steps will do additional L2 decay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a second thought, if general schedules are on their way, it might be confusing having to schedule "decoupled_weight_decay", while the input argument is called "weight decay"?
| for group in self.param_groups: | ||
| for p in group['params']: | ||
| if p.grad is not None: | ||
| p.data.mul_(1 - group['decoupled_weight_decay']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need tests here showing under which condition this implementation and the current one agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This I haven't included yet, what's the format you suggest? Keep the old optimizer and compare the outputs?
…uement with weight_decay. Pre-evaluate closures.
|
Thanks for the review :)
Optimizers that rely on the value of the parameter (or on closures) will still need a custom implementation of decoupled weight decay. This allows to do the copy within the loop over the parameters, only copying one parameter at a time, compared to copying all parameters in a wrapper. |
|
Hi @PhilJd! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks! |
As discussed in #22343, this pull request adds a factory function to create weight decay optimizers (Loshchilov & Hutter), reducing code duplication and allowing to extend arbitrary optimizers with decoupled weight decay.
Includes AdamW and SGDW by default.
Note: This implementation does not multiply by learning rate and is therefore blocked by #22343.
Note: The current implementation only works for optimizers that use the gradient but not the value of the parameter to optimize as the weight decay is applied before the optimizer update. If desired, I can extend the factory with a boolean
optimizer_requires_value, which forces decay of the variable as the last step (requiring a copy of the parameter).Ping @vincentqb :)