-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Documentation for torch.optim.swa_utils
#41228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
💊 CI failures summary and remediationsAs of commit c7d5bbd (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 8 times. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Let's see what this looks like here, once the PR has landed.
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vincentqb has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vincentqb has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vincentqb has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@vincentqb merged this pull request in 509c18a. |
|
cc @jlin27 to add to the release |
| >>> loss_fn(model(input), target).backward() | ||
| >>> optimizer.step() | ||
| >>> if i > swa_start: | ||
| >>> swa_model.update_parameters(model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@izmailovpavel Hi, thanks for the implementation of SWA in PyTorch! Do you find it useful to average the weights during annealing phase? By default anneal_epochs equals to 10 (and cannot be set to 0), so in this snippet for the first 10 epochs after swa_start weights will be averaged during annealing phase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Daniil, thank you for looking into the implementation! Generally, it's reasonable to average the weights during the annealing phase, but I imagine there could be cases when it's not desirable, e.g. when the learning rate before the annealing is way too high
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the response! It is just confusing a bit. According to the paper, the purpose of SWA is to average weights during exploration of loss surface (to find high-performing networks), but averaging during annealing would lead to averaging with different (decreasing) learning rates, which paper refers as non suitable (links to experiments in Ruppert, 1988) for improving generalization, due to SGD does not perform very differently under this schedule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whether or not averaging in the annealing phase would be desirable would depend on how much the learning rate changes in the annealing phase. Generally, averaging at different learning rates can work well, see e.g. https://arxiv.org/abs/1806.05594. Although it can also be bad, as you said
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be safer to fix it in the docs as you suggest.
lr_anneal_epochs = 10
swa_scheduler = SWALR(optimizer, swa_lr=0.05, anneal_epochs=lr_anneal_epochs)
...
if i > swa_start + lr_anneal_epochs:
swa_model.update_parameters(model)
It makes the example a bit more complex... @vincentqb @andrewgordonwilson what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is any chance to remove annealing phase from the scheduler? I believe this is the part which accidentally differs from the paper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be safer to fix it in the docs as you suggest.
lr_anneal_epochs = 10 swa_scheduler = SWALR(optimizer, swa_lr=0.05, anneal_epochs=lr_anneal_epochs) ... if i > swa_start + lr_anneal_epochs: swa_model.update_parameters(model)It makes the example a bit more complex... @vincentqb @andrewgordonwilson what do you think?
What would be an alternate fix that would make this simpler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the simplest option is to allow SWALR take anneal_epochs=0 by default, thus skip annealing phase. In this case this snippet will be consistent to the paper and can be left as is (but it requires SWALR code modification). Can make a PR if it will work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for not following up on this discussion for so long. @Daniil-Osokin I agree that we should allow anneal_epochs=0, and I think it would be great if you could make a pull request for that. I am not sure if we should by default set the number of annealing epochs to zero or not, I am up for discussion. I think both options are reasonable, but I agree that it may be better to set anneal_epochs=0 by default.
This PR adds a description of
torch.optim.swa_utilsadded in #35032 to the docs atdocs/source/optim.rst. Please let me know what you think!@vincentqb @andrewgordonwilson