Skip to content

Conversation

@xwang233
Copy link
Collaborator

Fix https://discuss.pytorch.org/t/illegal-memory-access-when-i-use-groupnorm/95800

dX is a Tensor, comparing dX with nullptr was wrong.

cc @BIT-silence who wrote the kernel.

The test couldn't pass with rtol=0 and x.requires_grad=True, so I have to update that to 1e-5.

@xwang233 xwang233 requested a review from ezyang September 17, 2020 05:58
@xwang233
Copy link
Collaborator Author

cc @ptrblck

Copy link
Contributor

@xiaomengy xiaomengy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this.

int64_t group,
const int64_t C,
const int64_t HxW,
const int64_t group,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this const required?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are not strictly required. I assume setting them to const may enable better compiler optimizations.

Copy link
Contributor

@xiaomengy xiaomengy Sep 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may not generate relatively large differences. From the code style point of view, I'd suggest to make them same as others. But it is fine to keep them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Const in the argument list doesn't matter for optimizations. If you don't store into the variable the compiler should be easily able to view it. It is mostly a style thing (to prevent accidental reassignment).

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BIT-silence has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@dr-ci
Copy link

dr-ci bot commented Sep 17, 2020

💊 CI failures summary and remediations

As of commit d8e061e (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 8 times.

@xwang233
Copy link
Collaborator Author

Seems like the two failures are both numerical issue. Should I increase the tolerance to 1e-3?

The XLA failure seems real, and we can temporarily disable the XLA test. @ailzhang

@xiaomengy
Copy link
Contributor

Seems like the two failures are both numerical issue. Should I increase the tolerance to 1e-3?

The XLA failure seems real, and we can temporarily disable the XLA test. @ailzhang

I think you may do that for a quick fix. Actually the test for GN looks not very stable. I also have some efforts to improve the LayerNorm's numerical stability as #40307 and #40308. I will also apply that to GN later. So I think it should be fine to do that and leave a TODO in the test.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BIT-silence has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@xiaomengy
Copy link
Contributor

Seems like the two failures are both numerical issue. Should I increase the tolerance to 1e-3?
The XLA failure seems real, and we can temporarily disable the XLA test. @ailzhang

I think you may do that for a quick fix. Actually the test for GN looks not very stable. I also have some efforts to improve the LayerNorm's numerical stability as #40307 and #40308. I will also apply that to GN later. So I think it should be fine to do that and leave a TODO in the test.

I have confirmed the improvement in #40308 can help improve the test stability. So I will merge this PR as a fix and then working on the same improvement of the numerical stability as LN soon. Thanks for the fix.

@codecov
Copy link

codecov bot commented Sep 17, 2020

Codecov Report

Merging #44863 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #44863   +/-   ##
=======================================
  Coverage   67.90%   67.90%           
=======================================
  Files         384      384           
  Lines       49878    49878           
=======================================
  Hits        33868    33868           
  Misses      16010    16010           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9909327...d8e061e. Read the comment docs.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BIT-silence has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@BIT-silence merged this pull request in 1694fde.

xuzhao9 pushed a commit that referenced this pull request Sep 18, 2020
Summary:
Fix https://discuss.pytorch.org/t/illegal-memory-access-when-i-use-groupnorm/95800

`dX` is a Tensor, comparing `dX` with `nullptr` was wrong.

cc BIT-silence who wrote the kernel.

The test couldn't pass with `rtol=0` and `x.requires_grad=True`, so I have to update that to `1e-5`.

Pull Request resolved: #44863

Reviewed By: mruberry

Differential Revision: D23754101

Pulled By: BIT-silence

fbshipit-source-id: 2eb0134dd489480e5ae7113a7d7b84629104cd49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants