Add decoupled weight decay factory to create optimizerWs. #23158

PhilJd · 2019-07-22T10:51:34Z

As discussed in #22343, this pull request adds a factory function to create weight decay optimizers (Loshchilov & Hutter), reducing code duplication and allowing to extend arbitrary optimizers with decoupled weight decay.

Includes AdamW and SGDW by default.

Note: This implementation does not multiply by learning rate and is therefore blocked by #22343.

Note: The current implementation only works for optimizers that use the gradient but not the value of the parameter to optimize as the weight decay is applied before the optimizer update. If desired, I can extend the factory with a boolean optimizer_requires_value, which forces decay of the variable as the last step (requiring a copy of the parameter).

Ping @vincentqb :)

… Instantiates AdamW and SGDW.

vincentqb

Thanks for looking into this!

As discussed in #22343, this pull request adds a factory function to create weight decay optimizers (Loshchilov & Hutter), reducing code duplication and allowing to extend arbitrary optimizers with decoupled weight decay.

The duplication of code between SGD/W and Adam/W is not necessarily bad -- and the versions are very easy to compare with a simple diff. However, I do agree that, if more and more optimizers want the weight decay, it would not be sustainable. We also do want a standard way of introducing weight (or other parameters) schedulers (and I'll link back here to reference back on schedulers' design).

Includes AdamW and SGDW by default.

Note: This implementation does not multiply by learning rate and is therefore blocked by #22343.

We will need to make sure the current AdamW from #21250 and this new version agree under reasonable conditions, so we can supersede the current version with this one.

Note: The current implementation only works for optimizers that use the gradient but not the value of the parameter to optimize as the weight decay is applied before the optimizer update. If desired, I can extend the factory with a boolean optimizer_requires_value, which forces decay of the variable as the last step (requiring a copy of the parameter).

I would indeed expect a weight scheduler to be applied at the end. Which step makes you require the copy of the parameter when moving the weight decay at the end?

vincentqb · 2019-07-31T21:02:03Z

docs/source/optim.rst

    :members:
 .. autoclass:: AdamW
    :members:
+.. autoclass:: AdamW2


What is AdamW2?

There is an implementation of AdamW in PyTorch since #21250. In the version implemented, the decay is taken from theta, but the weight and lr schedulers are still coupled. I would expect the new implementation to supersede the current one.

Sorry, AdamW2 was a relict from comparing the old with the new optimizer.
I've now copied the same update step to the wrapper.

vincentqb · 2019-07-31T21:03:33Z

torch/optim/decoupled_weight_decay_optim.py

+        def __init__(self, decoupled_weight_decay, *args, **kwargs):
+            super(DecoupledWeightDecayOptimizer, self).__init__(*args, **kwargs)
+            if self.defaults["weight_decay"] != 0:
+                warnings.warn(


Do people want to do both at the same time? I suspect people understand AdamW and SGDW as not having L2 anymore. If so, I'd suggest to simply biggy back on the parameter provided in Adam. There shouldn't be confusion when someone invokes AdamW which regularization they want.

I've changed the the input argument to weight_decay. Internally however, it still needs to be called differently, otherwise the optimizers' update steps will do additional L2 decay.

On a second thought, if general schedules are on their way, it might be confusing having to schedule "decoupled_weight_decay", while the input argument is called "weight decay"?

vincentqb · 2019-07-31T21:06:15Z

torch/optim/decoupled_weight_decay_optim.py

+            for group in self.param_groups:
+                for p in group['params']:
+                    if p.grad is not None:
+                        p.data.mul_(1 - group['decoupled_weight_decay'])


We'll need tests here showing under which condition this implementation and the current one agree.

This I haven't included yet, what's the format you suggest? Keep the old optimizer and compare the outputs?

…uement with weight_decay. Pre-evaluate closures.

PhilJd · 2019-08-05T09:14:29Z

Thanks for the review :)
I'm decaying first, as the update step usually only requires the gradient (so changing p has no effect), while the decay depends on the value of the parameter (so changing p beforehand has an effect).

original formula:
p = -(lr * p.grad) - (p * wd) # p = 1.8
decay after update, uses the update p for decay:
opt.step() # p -> 1
decay(p) # p -> 0.9
decay before update uses the original value of p.
decay(p) # p -> 1.8
opt.step() # p -> 1

Optimizers that rely on the value of the parameter (or on closures) will still need a custom implementation of decoupled weight decay. This allows to do the copy within the loop over the parameters, only copying one parameter at a time, compared to copying all parameters in a wrapper.
However, most commonly used optimizers only use the gradient, so this wrapper is still reasonable ;)

facebook-github-bot · 2022-03-29T18:43:30Z

Hi @PhilJd!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

Add decoupled weight decay factory to create weight decay optimizers.…

6d3e771

… Instantiates AdamW and SGDW.

pytorchbot added module: docs Related to our documentation, both in docs/ and docblocks module: optimizer Related to torch.optim labels Jul 22, 2019

ezyang added the open source label Jul 22, 2019

gchanan added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 23, 2019

vincentqb reviewed Jul 31, 2019

View reviewed changes

PhilJd added 2 commits August 5, 2019 10:36

Add support for sparse optimizers. Replace decoupled_weight_decay arg…

57db48a

…uement with weight_decay. Pre-evaluate closures.

Fix linter errors.

66cb4ac

PhilJd changed the title ~~Add decoupled weight decay factory to remove code duplication.~~ Add decoupled weight decay factory to create optimizerWs. Aug 5, 2019

PhilJd closed this May 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add decoupled weight decay factory to create optimizerWs. #23158

Add decoupled weight decay factory to create optimizerWs. #23158

Uh oh!

PhilJd commented Jul 22, 2019

Uh oh!

vincentqb left a comment

Uh oh!

vincentqb Jul 31, 2019

Uh oh!

PhilJd Aug 5, 2019

Uh oh!

vincentqb Jul 31, 2019

Uh oh!

PhilJd Aug 5, 2019

Uh oh!

PhilJd Aug 5, 2019

Uh oh!

vincentqb Jul 31, 2019

Uh oh!

PhilJd Aug 5, 2019

Uh oh!

PhilJd commented Aug 5, 2019

Uh oh!

facebook-github-bot commented Mar 29, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Add decoupled weight decay factory to create optimizerWs. #23158

Add decoupled weight decay factory to create optimizerWs. #23158

Uh oh!

Conversation

PhilJd commented Jul 22, 2019

Uh oh!

vincentqb left a comment

Choose a reason for hiding this comment

Uh oh!

vincentqb Jul 31, 2019

Choose a reason for hiding this comment

Uh oh!

PhilJd Aug 5, 2019

Choose a reason for hiding this comment

Uh oh!

vincentqb Jul 31, 2019

Choose a reason for hiding this comment

Uh oh!

PhilJd Aug 5, 2019

Choose a reason for hiding this comment

Uh oh!

PhilJd Aug 5, 2019

Choose a reason for hiding this comment

Uh oh!

vincentqb Jul 31, 2019

Choose a reason for hiding this comment

Uh oh!

PhilJd Aug 5, 2019

Choose a reason for hiding this comment

Uh oh!

PhilJd commented Aug 5, 2019

Uh oh!

facebook-github-bot commented Mar 29, 2022

Action Required

Process

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants