-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Fix a GroupNorm cuda bug when input does not require_grad #44863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @ptrblck |
xiaomengy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this.
| int64_t group, | ||
| const int64_t C, | ||
| const int64_t HxW, | ||
| const int64_t group, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this const required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are not strictly required. I assume setting them to const may enable better compiler optimizations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this may not generate relatively large differences. From the code style point of view, I'd suggest to make them same as others. But it is fine to keep them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Const in the argument list doesn't matter for optimizations. If you don't store into the variable the compiler should be easily able to view it. It is mostly a style thing (to prevent accidental reassignment).
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BIT-silence has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
💊 CI failures summary and remediationsAs of commit d8e061e (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 8 times. |
|
Seems like the two failures are both numerical issue. Should I increase the tolerance to The XLA failure seems real, and we can temporarily disable the XLA test. @ailzhang |
I think you may do that for a quick fix. Actually the test for GN looks not very stable. I also have some efforts to improve the LayerNorm's numerical stability as #40307 and #40308. I will also apply that to GN later. So I think it should be fine to do that and leave a TODO in the test. |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BIT-silence has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
I have confirmed the improvement in #40308 can help improve the test stability. So I will merge this PR as a fix and then working on the same improvement of the numerical stability as LN soon. Thanks for the fix. |
Codecov Report
@@ Coverage Diff @@
## master #44863 +/- ##
=======================================
Coverage 67.90% 67.90%
=======================================
Files 384 384
Lines 49878 49878
=======================================
Hits 33868 33868
Misses 16010 16010 Continue to review full report at Codecov.
|
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BIT-silence has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@BIT-silence merged this pull request in 1694fde. |
Summary: Fix https://discuss.pytorch.org/t/illegal-memory-access-when-i-use-groupnorm/95800 `dX` is a Tensor, comparing `dX` with `nullptr` was wrong. cc BIT-silence who wrote the kernel. The test couldn't pass with `rtol=0` and `x.requires_grad=True`, so I have to update that to `1e-5`. Pull Request resolved: #44863 Reviewed By: mruberry Differential Revision: D23754101 Pulled By: BIT-silence fbshipit-source-id: 2eb0134dd489480e5ae7113a7d7b84629104cd49
Fix https://discuss.pytorch.org/t/illegal-memory-access-when-i-use-groupnorm/95800
dXis a Tensor, comparingdXwithnullptrwas wrong.cc @BIT-silence who wrote the kernel.
The test couldn't pass with
rtol=0andx.requires_grad=True, so I have to update that to1e-5.