-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Fix bug on the backpropagation of LayerNorm when create_graph=True #41595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BIT-silence has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
Thanks for fixing this. |
💊 CI failures summary and remediationsAs of commit 556b74a (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
albanD
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix.
I am surprised that our tests didn't caught this. The layer norm double backward should be checked with finite difference. It is properly listed in https://github.com/pytorch/pytorch/blob/master/torch/testing/_internal/common_nn.py
Or is it that we don't test for cases where gamma is different from 1?
Also the ci failures looks related.
I think somehow our double grad check didn't check the case when gamma and beta are defined. My comments in the code can help fix the test. We should check if gamma is defined before computing the grads. |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BIT-silence has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
Please add a test in common_nn.py or in test_nn.py that would check this behavior (that is, would fail before your fix and pass with your fix). |
|
It is hard to write a test code using current test frameworks(checking double backward with finite difference). It actually passed double backward checks with simple test inputs, :<(maybe current implementation has analytically no effects on double backward?). So, I had no option but to add ad-hoc test. |
I can understand it will be a little hard to add such a test in a short time. And when adding the test, there is also a risk that some other ops may fail this test. So I think we can open another issue to explain this missing test and tracking that in the new issue. What do you think? @ngimel |
albanD
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense that you need a custom test. That looks good. We might want to open a task to make this a generic test indeed.
test/test_nn.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you just want to add a simple test outside of the generic test framework, you can just add a new method to the TestNN class above and it will just run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a test class inherited from TestCase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually avoid new classes as, as you can see just below, we have custom setup code that is simpler to handle if you have a shared class.
Would it be possible to just move this test in the TestNN class that is just below?
Signed-off-by: Vinnam Kim <vinnamkim@gmail.com>
Co-authored-by: Xiaomeng Yang <bit.yangxm@gmail.com>
Signed-off-by: Vinnam Kim <vinnam.kim@gmail.com>
Signed-off-by: Vinnam Kim <vinnam.kim@gmail.com>
albanD
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks great thank!
Can you just rebase on top of master to make sure all CI is fine?
test/test_nn.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the linter might not be happy with the removal of this line.
09cccc6 to
0f68a81
Compare
Signed-off-by: Vinnam Kim <vinnam.kim@gmail.com>
0f68a81 to
556b74a
Compare
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BIT-silence has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
xiaomengy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The BC test should not be related.
|
@BIT-silence merged this pull request in 825a387. |
Solve an issue #41332
I found the bug at #41332 is caused by LayerNorm.
Current implementations of LayerNorm have a disparity between
create_graph=FalseCUDA implementationcreate_graph=TrueimplementationWith this bug-fix, #41332 is solved.
@ailing @BIT-silence
Signed-off-by: Vinnam Kim vinnamkim@gmail.com