[v1.7 patch] Disallow creation of ProcessGroupNCCL without GPUs. (#45642) #46070

pritamdamania87 · 2020-10-09T00:12:52Z

Summary:

Note: This PR has been merged into master at b5a2f04 after the 1.7 branch cut
(see original PR: #45642). This PR is to merge it into the 1.7 branch.

---- Original Commit Description Follows ---

Pull Request resolved: #45642

Prior to #45181, initializing a
NCCL process group would work even if no GPUs were present. Although, now since
init_process_group calls barrier() this would fail.

In general the problem was that we could initialize ProcessGroupNCCL without
GPUs and then if we called a method like barrier() the process would crash
since we do % numGPUs resulting in division by zero.
ghstack-source-id: 113490343

Test Plan: waitforbuildbot

Reviewed By: osalpekar

Differential Revision: D24038839

fbshipit-source-id: a1f1db52cabcfb83e06c1a11ae9744afbf03f8dc

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

torch.vmap is a prototype feature and should not be in the stable binary. This PR: - Removes the `torch.vmap` API - Removes the documentation entry for `torch.vmap` - Changes the vmap tests to use an internal API instead of `torch.vmap`. Test Plan: - Tested locally (test_torch, test_type_hints, test_vmap), but also wait for CI.

ghstack-source-id: a3aaa9b Pull Request resolved: #45461

Summary: This PR enables PE + TE for 1.7 Pull Request resolved: #45546 Reviewed By: ZolotukhinM Differential Revision: D24006940 Pulled By: Krovatkin fbshipit-source-id: a3326077d34a023941acdb06c4907c96e7ba0115

ghstack-source-id: 4c14034 Pull Request resolved: #45626 Co-authored-by: Michael Suo <suo@suo-fedora-mj0c3k9r.dhcp.thefacebook.com>

…#45859) Co-authored-by: Eli Uriegas <eliuriegas@fb.com>

Summary: Pin the libuv versoin to v1.39 for Windows platform. Pull Request resolved: #45553 Reviewed By: SciPioneer Differential Revision: D24017246 Pulled By: mrshenli fbshipit-source-id: ec69f864a7acfbdddd60c3d2b442294ec3e34558 Co-authored-by: gunandrose4u <52735340+gunandrose4u@users.noreply.github.com>

Pull Request resolved: #45554 Reviewed By: izdeby Differential Revision: D24016825 Pulled By: mrshenli fbshipit-source-id: 332d860429626a915c06f98cad31e6db1cbc4eb1 Co-authored-by: gunandrose4u <52735340+gunandrose4u@users.noreply.github.com>

Summary: Pull Request resolved: #45848 This is a resubmit of the following stack: * start: #45093 * end: #45306 The original stack was reverted due to build failure, resubmitting. Test Plan: Imported from OSS Reviewed By: jerryzh168 Differential Revision: D24117781 Pulled By: vkuzo fbshipit-source-id: fb767fff2b044cfbba695ca3949221904fc8931f

Summary: Pull Request resolved: #45543 This PR adds documentation for the c10d Store to the public docs. Previously these docs were missing although we exposed a lightly-used (but potentially useful) Python API for our distributed key-value store. ghstack-source-id: 113409195 Test Plan: Will verify screenshots by building the docs. Reviewed By: pritamdamania87 Differential Revision: D24005598 fbshipit-source-id: 45c3600e7c3f220710e99a0483a9ce921d75d044

Summary: Originally introduced in #45023. When I was doing test in the original PR, it was a Conv3d, so this problem was not discovered. Arrays in `ConvolutionParams` have a fixed length of 3 or 5. This is because `max_dim` is set as a constexpr of 3, regardless of Conv2d or Conv3d. The current code will make some error message be weird. See below in the comments. https://github.com/pytorch/pytorch/blob/9201c37d020007979e144693d86c8e8599e2fd8f/aten/src/ATen/native/cudnn/Conv.cpp#L212-L226 Pull Request resolved: #45729 Reviewed By: mruberry Differential Revision: D24081542 Pulled By: ngimel fbshipit-source-id: 141f8946f4d0db63a723131775731272abeaa6ab

Summary: Export of embedding bag with dynamic list of offsets. Pull Request resolved: #44693 Reviewed By: malfet Differential Revision: D23831980 Pulled By: bzinodev fbshipit-source-id: 3eaff1a0f20d1bcfb8039e518d78c491be381e1a

) (#45755) Summary: * Support propagating `dim_param` in ONNX by encoding as `ShapeSymbol` in `SymbolicShape` of outputs. If export is called with `dynamic_axes` provided, shape inference will start with these axes set as dynamic. * Add new test file `test_pytorch_onnx_shape_inference.py`, reusing all test cases from `test_pytorch_onnx_onnxruntime.py`, but focus on validating shape for all nodes in graph. Currently this is not enabled in the CI, since there are still quite some existing issues and corner cases to fix. The test is default to run only at opset 12. * Bug fixes, such as div, _len, and peephole.cpp passes for PackPadded, and LogSoftmaxCrossEntropy. * This PR depends on existing PR such as 44332. Pull Request resolved: #44920 Reviewed By: eellison Differential Revision: D23958398 Pulled By: bzinodev fbshipit-source-id: 00479d9bd19c867d526769a15ba97ec16d56e51d

Summary: Because access to https://sourceware.org/git/valgrind.git can be really slow especially in some regions Pull Request resolved: #45914 Reviewed By: seemethere Differential Revision: D24144420 Pulled By: malfet fbshipit-source-id: a454c8c3182c570ec344bf6468bb5e55d8b8da79

Summary: Fixes #45724 Pull Request resolved: #46001 Reviewed By: mruberry Differential Revision: D24184058 Pulled By: ngimel fbshipit-source-id: 7d2bab3206ddbc10a7cae3efd9b5e253f38400a9

) Summary: Note: This PR has been merged into master at b5a2f04 after the 1.7 branch cut (see original PR: #45642). This PR is to merge it into the 1.7 branch. ---- Original Commit Description Follows --- Pull Request resolved: #45642 Prior to #45181, initializing a NCCL process group would work even if no GPUs were present. Although, now since init_process_group calls `barrier()` this would fail. In general the problem was that we could initialize ProcessGroupNCCL without GPUs and then if we called a method like `barrier()` the process would crash since we do % numGPUs resulting in division by zero. ghstack-source-id: 113490343 Test Plan: waitforbuildbot Reviewed By: osalpekar Differential Revision: D24038839 fbshipit-source-id: a1f1db52cabcfb83e06c1a11ae9744afbf03f8dc

dr-ci · 2020-10-09T00:41:29Z

💊 CI failures summary and remediations

As of commit 9af59ca (more details on the Dr. CI page):

3/3 failures possibly* introduced in this PR
- 3/3 non-CircleCI failure(s)

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 4 times.

albanD · 2020-10-09T00:43:27Z

I think there is a rebase issue no?

pritamdamania87 · 2020-10-09T01:16:34Z

I think there is a rebase issue no?

Yes, I messed up something. Will figure it out and resubmit.

pritamdamania87 · 2020-10-09T01:24:14Z

This should be the correct PR for cherry-pick: #46073 I updated the release tracker as well.

seemethere and others added 20 commits September 30, 2020 09:39

Update target determinator to point to release/1.7

cf07ba5

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Add allowlist for complex backward (#45602)

e8cea53

ghstack-source-id: a3aaa9b Pull Request resolved: #45461

Enable PE + TE (#45546) (#45591)

07e66d7

Summary: This PR enables PE + TE for 1.7 Pull Request resolved: #45546 Reviewed By: ZolotukhinM Differential Revision: D24006940 Pulled By: Krovatkin fbshipit-source-id: a3326077d34a023941acdb06c4907c96e7ba0115

Make torch.package private and add a big warning (#45628)

fc8f987

ghstack-source-id: 4c14034 Pull Request resolved: #45626 Co-authored-by: Michael Suo <suo@suo-fedora-mj0c3k9r.dhcp.thefacebook.com>

[1.7] Hide FX (#45631)

543d097

patch #45586 (#45601)

1ffcdd0

[1.7] .jenkins: switch to compare against stable and update allowlist (…

d728e23

…#45859) Co-authored-by: Eli Uriegas <eliuriegas@fb.com>

[iOS] 1.7 hotfix (#45891)

7d0c7b3

[ONNX] Update embedding_bag export (#44693) (#45756)

8f8da60

Summary: Export of embedding bag with dynamic list of offsets. Pull Request resolved: #44693 Reviewed By: malfet Differential Revision: D23831980 Pulled By: bzinodev fbshipit-source-id: 3eaff1a0f20d1bcfb8039e518d78c491be381e1a

Disable angle backwards and handle r to c backward for add (#45839)

65a1827

Workaround for cublas bug for 45724 (#46001) (#46042)

653d766

Summary: Fixes #45724 Pull Request resolved: #46001 Reviewed By: mruberry Differential Revision: D24184058 Pulled By: ngimel fbshipit-source-id: 7d2bab3206ddbc10a7cae3efd9b5e253f38400a9

pritamdamania87 requested review from albanD, apaszke, mrshenli, pietern, rohan-varma and zhaojuanmao as code owners October 9, 2020 00:12

facebook-github-bot added oncall: jit Add this issue/PR to JIT oncall triage queue oncall: distributed Add this issue/PR to distributed oncall triage queue fx labels Oct 9, 2020

pritamdamania87 mentioned this pull request Oct 9, 2020

[v1.7.0] Release Tracker #45592

Closed

pritamdamania87 closed this Oct 9, 2020

facebook-github-bot deleted the nccl_gpu_1.7_cherry_pick branch January 27, 2021 18:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[v1.7 patch] Disallow creation of ProcessGroupNCCL without GPUs. (#45642) #46070

[v1.7 patch] Disallow creation of ProcessGroupNCCL without GPUs. (#45642) #46070

Uh oh!

pritamdamania87 commented Oct 9, 2020

Uh oh!

dr-ci bot commented Oct 9, 2020 •

edited

Loading

Uh oh!

albanD commented Oct 9, 2020

Uh oh!

pritamdamania87 commented Oct 9, 2020

Uh oh!

pritamdamania87 commented Oct 9, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

[v1.7 patch] Disallow creation of ProcessGroupNCCL without GPUs. (#45642) #46070

[v1.7 patch] Disallow creation of ProcessGroupNCCL without GPUs. (#45642) #46070

Uh oh!

Conversation

pritamdamania87 commented Oct 9, 2020

Uh oh!

dr-ci bot commented Oct 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Extra GitHub checks: 1 failed

ci.pytorch.org: 1 failed

codecov.io: 1 failed

Uh oh!

albanD commented Oct 9, 2020

Uh oh!

pritamdamania87 commented Oct 9, 2020

Uh oh!

pritamdamania87 commented Oct 9, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

dr-ci bot commented Oct 9, 2020 •

edited

Loading