-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[v1.7 patch] Disallow creation of ProcessGroupNCCL without GPUs. (#45642) #46070
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
torch.vmap is a prototype feature and should not be in the stable binary. This PR: - Removes the `torch.vmap` API - Removes the documentation entry for `torch.vmap` - Changes the vmap tests to use an internal API instead of `torch.vmap`. Test Plan: - Tested locally (test_torch, test_type_hints, test_vmap), but also wait for CI.
…#45859) Co-authored-by: Eli Uriegas <eliuriegas@fb.com>
Summary: Pin the libuv versoin to v1.39 for Windows platform. Pull Request resolved: #45553 Reviewed By: SciPioneer Differential Revision: D24017246 Pulled By: mrshenli fbshipit-source-id: ec69f864a7acfbdddd60c3d2b442294ec3e34558 Co-authored-by: gunandrose4u <52735340+gunandrose4u@users.noreply.github.com>
Pull Request resolved: #45554 Reviewed By: izdeby Differential Revision: D24016825 Pulled By: mrshenli fbshipit-source-id: 332d860429626a915c06f98cad31e6db1cbc4eb1 Co-authored-by: gunandrose4u <52735340+gunandrose4u@users.noreply.github.com>
Summary: Pull Request resolved: #45848 This is a resubmit of the following stack: * start: #45093 * end: #45306 The original stack was reverted due to build failure, resubmitting. Test Plan: Imported from OSS Reviewed By: jerryzh168 Differential Revision: D24117781 Pulled By: vkuzo fbshipit-source-id: fb767fff2b044cfbba695ca3949221904fc8931f
Summary: Pull Request resolved: #45543 This PR adds documentation for the c10d Store to the public docs. Previously these docs were missing although we exposed a lightly-used (but potentially useful) Python API for our distributed key-value store. ghstack-source-id: 113409195 Test Plan: Will verify screenshots by building the docs. Reviewed By: pritamdamania87 Differential Revision: D24005598 fbshipit-source-id: 45c3600e7c3f220710e99a0483a9ce921d75d044
Summary: Originally introduced in #45023. When I was doing test in the original PR, it was a Conv3d, so this problem was not discovered. Arrays in `ConvolutionParams` have a fixed length of 3 or 5. This is because `max_dim` is set as a constexpr of 3, regardless of Conv2d or Conv3d. The current code will make some error message be weird. See below in the comments. https://github.com/pytorch/pytorch/blob/9201c37d020007979e144693d86c8e8599e2fd8f/aten/src/ATen/native/cudnn/Conv.cpp#L212-L226 Pull Request resolved: #45729 Reviewed By: mruberry Differential Revision: D24081542 Pulled By: ngimel fbshipit-source-id: 141f8946f4d0db63a723131775731272abeaa6ab
) (#45755) Summary: * Support propagating `dim_param` in ONNX by encoding as `ShapeSymbol` in `SymbolicShape` of outputs. If export is called with `dynamic_axes` provided, shape inference will start with these axes set as dynamic. * Add new test file `test_pytorch_onnx_shape_inference.py`, reusing all test cases from `test_pytorch_onnx_onnxruntime.py`, but focus on validating shape for all nodes in graph. Currently this is not enabled in the CI, since there are still quite some existing issues and corner cases to fix. The test is default to run only at opset 12. * Bug fixes, such as div, _len, and peephole.cpp passes for PackPadded, and LogSoftmaxCrossEntropy. * This PR depends on existing PR such as 44332. Pull Request resolved: #44920 Reviewed By: eellison Differential Revision: D23958398 Pulled By: bzinodev fbshipit-source-id: 00479d9bd19c867d526769a15ba97ec16d56e51d
Summary: Because access to https://sourceware.org/git/valgrind.git can be really slow especially in some regions Pull Request resolved: #45914 Reviewed By: seemethere Differential Revision: D24144420 Pulled By: malfet fbshipit-source-id: a454c8c3182c570ec344bf6468bb5e55d8b8da79
) Summary: Note: This PR has been merged into master at b5a2f04 after the 1.7 branch cut (see original PR: #45642). This PR is to merge it into the 1.7 branch. ---- Original Commit Description Follows --- Pull Request resolved: #45642 Prior to #45181, initializing a NCCL process group would work even if no GPUs were present. Although, now since init_process_group calls `barrier()` this would fail. In general the problem was that we could initialize ProcessGroupNCCL without GPUs and then if we called a method like `barrier()` the process would crash since we do % numGPUs resulting in division by zero. ghstack-source-id: 113490343 Test Plan: waitforbuildbot Reviewed By: osalpekar Differential Revision: D24038839 fbshipit-source-id: a1f1db52cabcfb83e06c1a11ae9744afbf03f8dc
💊 CI failures summary and remediationsAs of commit 9af59ca (more details on the Dr. CI page):
Extra GitHub checks: 1 failed
ci.pytorch.org: 1 failedcodecov.io: 1 failed
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 4 times. |
|
I think there is a rebase issue no? |
Yes, I messed up something. Will figure it out and resubmit. |
|
This should be the correct PR for cherry-pick: #46073 I updated the release tracker as well. |
Summary:
Note: This PR has been merged into master at b5a2f04 after the 1.7 branch cut
(see original PR: #45642). This PR is to merge it into the 1.7 branch.
---- Original Commit Description Follows ---
Pull Request resolved: #45642
Prior to #45181, initializing a
NCCL process group would work even if no GPUs were present. Although, now since
init_process_group calls
barrier()this would fail.In general the problem was that we could initialize ProcessGroupNCCL without
GPUs and then if we called a method like
barrier()the process would crashsince we do % numGPUs resulting in division by zero.
ghstack-source-id: 113490343
Test Plan: waitforbuildbot
Reviewed By: osalpekar
Differential Revision: D24038839
fbshipit-source-id: a1f1db52cabcfb83e06c1a11ae9744afbf03f8dc