Fix a GroupNorm cuda bug when input does not require_grad #44863

xwang233 · 2020-09-17T05:58:21Z

Fix https://discuss.pytorch.org/t/illegal-memory-access-when-i-use-groupnorm/95800

dX is a Tensor, comparing dX with nullptr was wrong.

cc @BIT-silence who wrote the kernel.

The test couldn't pass with rtol=0 and x.requires_grad=True, so I have to update that to 1e-5.

xwang233 · 2020-09-17T05:58:48Z

cc @ptrblck

xiaomengy

Thanks for fixing this.

xiaomengy · 2020-09-17T06:19:22Z

aten/src/ATen/native/cuda/group_norm_kernel.cu

-    int64_t group,
+    const int64_t C,
+    const int64_t HxW,
+    const int64_t group,


Is this const required?

They are not strictly required. I assume setting them to const may enable better compiler optimizations.

I think this may not generate relatively large differences. From the code style point of view, I'd suggest to make them same as others. But it is fine to keep them.

I agree. Thanks!

Const in the argument list doesn't matter for optimizations. If you don't store into the variable the compiler should be easily able to view it. It is mostly a style thing (to prevent accidental reassignment).

facebook-github-bot

@BIT-silence has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

dr-ci · 2020-09-17T06:27:02Z

💊 CI failures summary and remediations

As of commit d8e061e (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 8 times.

…-no-grad-fix

xwang233 · 2020-09-17T18:14:09Z

Seems like the two failures are both numerical issue. Should I increase the tolerance to 1e-3?

The XLA failure seems real, and we can temporarily disable the XLA test. @ailzhang

xiaomengy · 2020-09-17T18:31:22Z

Seems like the two failures are both numerical issue. Should I increase the tolerance to 1e-3?

The XLA failure seems real, and we can temporarily disable the XLA test. @ailzhang

I think you may do that for a quick fix. Actually the test for GN looks not very stable. I also have some efforts to improve the LayerNorm's numerical stability as #40307 and #40308. I will also apply that to GN later. So I think it should be fine to do that and leave a TODO in the test.

facebook-github-bot

@BIT-silence has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

xiaomengy · 2020-09-17T22:10:17Z

Seems like the two failures are both numerical issue. Should I increase the tolerance to 1e-3?
The XLA failure seems real, and we can temporarily disable the XLA test. @ailzhang

I think you may do that for a quick fix. Actually the test for GN looks not very stable. I also have some efforts to improve the LayerNorm's numerical stability as #40307 and #40308. I will also apply that to GN later. So I think it should be fine to do that and leave a TODO in the test.

I have confirmed the improvement in #40308 can help improve the test stability. So I will merge this PR as a fix and then working on the same improvement of the numerical stability as LN soon. Thanks for the fix.

codecov · 2020-09-17T22:21:25Z

Codecov Report

Merging #44863 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #44863   +/-   ##
=======================================
  Coverage   67.90%   67.90%           
=======================================
  Files         384      384           
  Lines       49878    49878           
=======================================
  Hits        33868    33868           
  Misses      16010    16010

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9909327...d8e061e. Read the comment docs.

facebook-github-bot

@BIT-silence has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-09-18T02:14:24Z

@BIT-silence merged this pull request in 1694fde.

Summary: Fix https://discuss.pytorch.org/t/illegal-memory-access-when-i-use-groupnorm/95800 `dX` is a Tensor, comparing `dX` with `nullptr` was wrong. cc BIT-silence who wrote the kernel. The test couldn't pass with `rtol=0` and `x.requires_grad=True`, so I have to update that to `1e-5`. Pull Request resolved: #44863 Reviewed By: mruberry Differential Revision: D23754101 Pulled By: BIT-silence fbshipit-source-id: 2eb0134dd489480e5ae7113a7d7b84629104cd49

fix

9278e24

xwang233 requested a review from ezyang September 17, 2020 05:58

pytorchbot added the open source label Sep 17, 2020

xiaomengy approved these changes Sep 17, 2020

View reviewed changes

xiaomengy reviewed Sep 17, 2020

View reviewed changes

facebook-github-bot reviewed Sep 17, 2020

View reviewed changes

Merge remote-tracking branch 'upstream/viable/strict' into group-norm…

98c02e4

…-no-grad-fix

xwang233 added 2 commits September 17, 2020 11:50

update tolerance

b06c3b8

remove const

d8e061e

facebook-github-bot reviewed Sep 17, 2020

View reviewed changes

facebook-github-bot closed this in 1694fde Sep 18, 2020

facebook-github-bot added the merged label Sep 18, 2020

mruberry added the Merged label Oct 28, 2020

Fix a GroupNorm cuda bug when input does not require_grad #44863

Fix a GroupNorm cuda bug when input does not require_grad #44863

Uh oh!

Conversation

xwang233 commented Sep 17, 2020

Uh oh!

xwang233 commented Sep 17, 2020

Uh oh!

xiaomengy left a comment

Choose a reason for hiding this comment

Uh oh!

xiaomengy Sep 17, 2020

Choose a reason for hiding this comment

Uh oh!

xwang233 Sep 17, 2020

Choose a reason for hiding this comment

Uh oh!

xiaomengy Sep 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xwang233 Sep 17, 2020

Choose a reason for hiding this comment

Uh oh!

ezyang Sep 17, 2020

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

dr-ci bot commented Sep 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Uh oh!

xwang233 commented Sep 17, 2020

Uh oh!

xiaomengy commented Sep 17, 2020

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

xiaomengy commented Sep 17, 2020

Uh oh!

codecov bot commented Sep 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Sep 18, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

xiaomengy Sep 17, 2020 •

edited

Loading

dr-ci bot commented Sep 17, 2020 •

edited

Loading

codecov bot commented Sep 17, 2020 •

edited

Loading