Skip to content

Conversation

@PhilJd
Copy link

@PhilJd PhilJd commented Jul 22, 2019

As discussed in #22343, this pull request adds a factory function to create weight decay optimizers (Loshchilov & Hutter), reducing code duplication and allowing to extend arbitrary optimizers with decoupled weight decay.

Includes AdamW and SGDW by default.

Note: This implementation does not multiply by learning rate and is therefore blocked by #22343.

Note: The current implementation only works for optimizers that use the gradient but not the value of the parameter to optimize as the weight decay is applied before the optimizer update. If desired, I can extend the factory with a boolean optimizer_requires_value, which forces decay of the variable as the last step (requiring a copy of the parameter).

Ping @vincentqb :)

@pytorchbot pytorchbot added module: docs Related to our documentation, both in docs/ and docblocks module: optimizer Related to torch.optim labels Jul 22, 2019
@gchanan gchanan added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 23, 2019
Copy link
Contributor

@vincentqb vincentqb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this!

As discussed in #22343, this pull request adds a factory function to create weight decay optimizers (Loshchilov & Hutter), reducing code duplication and allowing to extend arbitrary optimizers with decoupled weight decay.

The duplication of code between SGD/W and Adam/W is not necessarily bad -- and the versions are very easy to compare with a simple diff. However, I do agree that, if more and more optimizers want the weight decay, it would not be sustainable. We also do want a standard way of introducing weight (or other parameters) schedulers (and I'll link back here to reference back on schedulers' design).

Includes AdamW and SGDW by default.

Note: This implementation does not multiply by learning rate and is therefore blocked by #22343.

We will need to make sure the current AdamW from #21250 and this new version agree under reasonable conditions, so we can supersede the current version with this one.

Note: The current implementation only works for optimizers that use the gradient but not the value of the parameter to optimize as the weight decay is applied before the optimizer update. If desired, I can extend the factory with a boolean optimizer_requires_value, which forces decay of the variable as the last step (requiring a copy of the parameter).

I would indeed expect a weight scheduler to be applied at the end. Which step makes you require the copy of the parameter when moving the weight decay at the end?

:members:
.. autoclass:: AdamW
:members:
.. autoclass:: AdamW2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is AdamW2?

There is an implementation of AdamW in PyTorch since #21250. In the version implemented, the decay is taken from theta, but the weight and lr schedulers are still coupled. I would expect the new implementation to supersede the current one.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, AdamW2 was a relict from comparing the old with the new optimizer.
I've now copied the same update step to the wrapper.

def __init__(self, decoupled_weight_decay, *args, **kwargs):
super(DecoupledWeightDecayOptimizer, self).__init__(*args, **kwargs)
if self.defaults["weight_decay"] != 0:
warnings.warn(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do people want to do both at the same time? I suspect people understand AdamW and SGDW as not having L2 anymore. If so, I'd suggest to simply biggy back on the parameter provided in Adam. There shouldn't be confusion when someone invokes AdamW which regularization they want.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the the input argument to weight_decay. Internally however, it still needs to be called differently, otherwise the optimizers' update steps will do additional L2 decay.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a second thought, if general schedules are on their way, it might be confusing having to schedule "decoupled_weight_decay", while the input argument is called "weight decay"?

for group in self.param_groups:
for p in group['params']:
if p.grad is not None:
p.data.mul_(1 - group['decoupled_weight_decay'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need tests here showing under which condition this implementation and the current one agree.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This I haven't included yet, what's the format you suggest? Keep the old optimizer and compare the outputs?

@PhilJd
Copy link
Author

PhilJd commented Aug 5, 2019

Thanks for the review :)
I'm decaying first, as the update step usually only requires the gradient (so changing p has no effect), while the decay depends on the value of the parameter (so changing p beforehand has an effect).

  • original formula:
    p = -(lr * p.grad) - (p * wd) # p = 1.8
  • decay after update, uses the update p for decay:
    opt.step() # p -> 1
    decay(p) # p -> 0.9
  • decay before update uses the original value of p.
    decay(p) # p -> 1.8
    opt.step() # p -> 1

Optimizers that rely on the value of the parameter (or on closures) will still need a custom implementation of decoupled weight decay. This allows to do the copy within the loop over the parameters, only copying one parameter at a time, compared to copying all parameters in a wrapper.
However, most commonly used optimizers only use the gradient, so this wrapper is still reasonable ;)

@PhilJd PhilJd changed the title Add decoupled weight decay factory to remove code duplication. Add decoupled weight decay factory to create optimizerWs. Aug 5, 2019
@facebook-github-bot
Copy link
Contributor

Hi @PhilJd!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

@PhilJd PhilJd closed this May 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module: docs Related to our documentation, both in docs/ and docblocks module: optimizer Related to torch.optim open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants