Skip to content

Conversation

@vinnamkim
Copy link
Contributor

Solve an issue #41332

I found the bug at #41332 is caused by LayerNorm.

Current implementations of LayerNorm have a disparity between

  1. create_graph=False CUDA implementation
  2. create_graph=True implementation

With this bug-fix, #41332 is solved.

@ailing @BIT-silence

Signed-off-by: Vinnam Kim vinnamkim@gmail.com

@zou3519 zou3519 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 17, 2020
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BIT-silence has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@xiaomengy
Copy link
Contributor

Thanks for fixing this.

@xiaomengy xiaomengy requested a review from ngimel July 17, 2020 15:42
@dr-ci
Copy link

dr-ci bot commented Jul 17, 2020

💊 CI failures summary and remediations

As of commit 556b74a (more details on the Dr. CI page):


  • 1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_backward_compatibility_check_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jul 21 16:49:18 The PR is introducing backward incompatible changes to the operator library. Please contact PyTorch team to confirm whether this change is wanted or not.
Jul 21 16:49:18 processing existing schema:  __str__(__torch__.torch.classes._TorchScriptTesting._StackString _0) -> (str _0) 
Jul 21 16:49:18 processing existing schema:  __init__(__torch__.torch.classes._TorchScriptTesting._PickleTester _0, int[] _1) -> (None _0) 
Jul 21 16:49:18 processing existing schema:  __getstate__(__torch__.torch.classes._TorchScriptTesting._PickleTester _0) -> (int[] _0) 
Jul 21 16:49:18 processing existing schema:  __setstate__(__torch__.torch.classes._TorchScriptTesting._PickleTester _0, int[] _1) -> (None _0) 
Jul 21 16:49:18 processing existing schema:  top(__torch__.torch.classes._TorchScriptTesting._PickleTester _0) -> (int _0) 
Jul 21 16:49:18 processing existing schema:  pop(__torch__.torch.classes._TorchScriptTesting._PickleTester _0) -> (int _0) 
Jul 21 16:49:18 processing existing schema:  get(__torch__.torch.classes._TorchScriptTesting._LiteInterpreterTest _0, Tensor _1) -> (str _0) 
Jul 21 16:49:18 processing existing schema:  __getstate__(__torch__.torch.classes._TorchScriptTesting._LiteInterpreterTest _0) -> (int _0) 
Jul 21 16:49:18 processing existing schema:  __setstate__(__torch__.torch.classes._TorchScriptTesting._LiteInterpreterTest _0, int _1) -> (None _0) 
Jul 21 16:49:18 processing existing schema:  __init__(__torch__.torch.classes.dist_rpc.WorkerInfo _0, str _1, int _2) -> (None _0) 
Jul 21 16:49:18 The PR is introducing backward incompatible changes to the operator library. Please contact PyTorch team to confirm whether this change is wanted or not.  
Jul 21 16:49:18  
Jul 21 16:49:18 Broken ops: [ 
Jul 21 16:49:18 	aten::split_with_sizes(Tensor self, int[] split_sizes, int dim=0) -> (Tensor[]) 
Jul 21 16:49:18 ] 
Jul 21 16:49:18 + cleanup 
Jul 21 16:49:18 + retcode=1 
Jul 21 16:49:18 + set +x 
Jul 21 16:49:18 =================== sccache compilation log =================== 
Jul 21 16:49:18 =========== If your build fails, plea

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 19 times.

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix.

I am surprised that our tests didn't caught this. The layer norm double backward should be checked with finite difference. It is properly listed in https://github.com/pytorch/pytorch/blob/master/torch/testing/_internal/common_nn.py
Or is it that we don't test for cases where gamma is different from 1?

Also the ci failures looks related.

@xiaomengy
Copy link
Contributor

Thanks for the fix.

I am surprised that our tests didn't caught this. The layer norm double backward should be checked with finite difference. It is properly listed in https://github.com/pytorch/pytorch/blob/master/torch/testing/_internal/common_nn.py
Or is it that we don't test for cases where gamma is different from 1?

Also the ci failures looks related.

I think somehow our double grad check didn't check the case when gamma and beta are defined. My comments in the code can help fix the test. We should check if gamma is defined before computing the grads.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BIT-silence has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ngimel
Copy link
Collaborator

ngimel commented Jul 17, 2020

Please add a test in common_nn.py or in test_nn.py that would check this behavior (that is, would fail before your fix and pass with your fix).

@vinnamkim
Copy link
Contributor Author

@albanD @BIT-silence @ngimel

It is hard to write a test code using current test frameworks(checking double backward with finite difference). It actually passed double backward checks with simple test inputs, :<(maybe current implementation has analytically no effects on double backward?). So, I had no option but to add ad-hoc test.

@xiaomengy
Copy link
Contributor

@albanD @BIT-silence @ngimel

It is hard to write a test code using current test frameworks(checking double backward with finite difference). It actually passed double backward checks with simple test inputs, :<(maybe current implementation has analytically no effects on double backward?). So, I had no option but to add ad-hoc test.

I can understand it will be a little hard to add such a test in a short time. And when adding the test, there is also a risk that some other ops may fail this test. So I think we can open another issue to explain this missing test and tracking that in the new issue. What do you think? @ngimel

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense that you need a custom test. That looks good. We might want to open a task to make this a generic test indeed.

test/test_nn.py Outdated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you just want to add a simple test outside of the generic test framework, you can just add a new method to the TestNN class above and it will just run.

Copy link
Contributor Author

@vinnamkim vinnamkim Jul 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a test class inherited from TestCase.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually avoid new classes as, as you can see just below, we have custom setup code that is simpler to handle if you have a shared class.
Would it be possible to just move this test in the TestNN class that is just below?

Vinnam Kim and others added 4 commits July 22, 2020 00:04
Signed-off-by: Vinnam Kim <vinnamkim@gmail.com>
Co-authored-by: Xiaomeng Yang <bit.yangxm@gmail.com>
Signed-off-by: Vinnam Kim <vinnam.kim@gmail.com>
Signed-off-by: Vinnam Kim <vinnam.kim@gmail.com>
@albanD albanD requested a review from xiaomengy July 21, 2020 15:08
Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks great thank!
Can you just rebase on top of master to make sure all CI is fine?

test/test_nn.py Outdated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the linter might not be happy with the removal of this line.

@vinnamkim vinnamkim force-pushed the patch_layernorm branch 2 times, most recently from 09cccc6 to 0f68a81 Compare July 21, 2020 15:13
Signed-off-by: Vinnam Kim <vinnam.kim@gmail.com>
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BIT-silence has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@xiaomengy xiaomengy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BC test should not be related.

@facebook-github-bot
Copy link
Contributor

@BIT-silence merged this pull request in 825a387.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants