Skip to content

Conversation

@pritamdamania87
Copy link
Contributor

Summary:

Note: This PR has been merged into master at b5a2f04 after the 1.7 branch cut
(see original PR: #45642). This PR is to merge it into the 1.7 branch.

---- Original Commit Description Follows ---

Pull Request resolved: #45642

Prior to #45181, initializing a
NCCL process group would work even if no GPUs were present. Although, now since
init_process_group calls barrier() this would fail.

In general the problem was that we could initialize ProcessGroupNCCL without
GPUs and then if we called a method like barrier() the process would crash
since we do % numGPUs resulting in division by zero.
ghstack-source-id: 113490343

Test Plan: waitforbuildbot

Reviewed By: osalpekar

Differential Revision: D24038839

fbshipit-source-id: a1f1db52cabcfb83e06c1a11ae9744afbf03f8dc

seemethere and others added 20 commits September 30, 2020 09:39
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
torch.vmap is a prototype feature and should not be in the stable
binary. This PR:
- Removes the `torch.vmap` API
- Removes the documentation entry for `torch.vmap`
- Changes the vmap tests to use an internal API instead of `torch.vmap`.

Test Plan:
- Tested locally (test_torch, test_type_hints, test_vmap), but also wait
for CI.
ghstack-source-id: a3aaa9b
Pull Request resolved: #45461
Summary:
This PR enables PE + TE for 1.7

Pull Request resolved: #45546

Reviewed By: ZolotukhinM

Differential Revision: D24006940

Pulled By: Krovatkin

fbshipit-source-id: a3326077d34a023941acdb06c4907c96e7ba0115
ghstack-source-id: 4c14034
Pull Request resolved: #45626

Co-authored-by: Michael Suo <suo@suo-fedora-mj0c3k9r.dhcp.thefacebook.com>
Summary:
Pin the libuv versoin to v1.39 for Windows platform.

Pull Request resolved: #45553

Reviewed By: SciPioneer

Differential Revision: D24017246

Pulled By: mrshenli

fbshipit-source-id: ec69f864a7acfbdddd60c3d2b442294ec3e34558

Co-authored-by: gunandrose4u <52735340+gunandrose4u@users.noreply.github.com>
Pull Request resolved: #45554

Reviewed By: izdeby

Differential Revision: D24016825

Pulled By: mrshenli

fbshipit-source-id: 332d860429626a915c06f98cad31e6db1cbc4eb1

Co-authored-by: gunandrose4u <52735340+gunandrose4u@users.noreply.github.com>
Summary:
Pull Request resolved: #45848

This is a resubmit of the following stack:
* start: #45093
* end: #45306

The original stack was reverted due to build failure,
resubmitting.

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D24117781

Pulled By: vkuzo

fbshipit-source-id: fb767fff2b044cfbba695ca3949221904fc8931f
Summary:
Pull Request resolved: #45543

This PR adds documentation for the c10d Store to the public docs. Previously these docs were missing although we exposed a lightly-used (but potentially useful) Python API for our distributed key-value store.
ghstack-source-id: 113409195

Test Plan: Will verify screenshots by building the docs.

Reviewed By: pritamdamania87

Differential Revision: D24005598

fbshipit-source-id: 45c3600e7c3f220710e99a0483a9ce921d75d044
Summary:
Originally introduced in #45023. When I was doing test in the original PR, it was a Conv3d, so this problem was not discovered.

Arrays in `ConvolutionParams` have a fixed length of 3 or 5. This is because `max_dim` is set as a constexpr of 3, regardless of Conv2d or Conv3d. The current code will make some error message be weird. See below in the comments.

https://github.com/pytorch/pytorch/blob/9201c37d020007979e144693d86c8e8599e2fd8f/aten/src/ATen/native/cudnn/Conv.cpp#L212-L226

Pull Request resolved: #45729

Reviewed By: mruberry

Differential Revision: D24081542

Pulled By: ngimel

fbshipit-source-id: 141f8946f4d0db63a723131775731272abeaa6ab
Summary:
Export of embedding bag with dynamic list of offsets.

Pull Request resolved: #44693

Reviewed By: malfet

Differential Revision: D23831980

Pulled By: bzinodev

fbshipit-source-id: 3eaff1a0f20d1bcfb8039e518d78c491be381e1a
) (#45755)

Summary:
* Support propagating `dim_param` in ONNX by encoding as `ShapeSymbol` in `SymbolicShape` of outputs. If export is called with `dynamic_axes` provided, shape inference will start with these axes set as dynamic.
* Add new test file `test_pytorch_onnx_shape_inference.py`, reusing all test cases from `test_pytorch_onnx_onnxruntime.py`, but focus on validating shape for all nodes in graph. Currently this is not enabled in the CI, since there are still quite some existing issues and corner cases to fix. The test is default to run only at opset 12.
* Bug fixes, such as div, _len, and peephole.cpp passes for PackPadded, and LogSoftmaxCrossEntropy.
* This PR depends on existing PR such as 44332.

Pull Request resolved: #44920

Reviewed By: eellison

Differential Revision: D23958398

Pulled By: bzinodev

fbshipit-source-id: 00479d9bd19c867d526769a15ba97ec16d56e51d
Summary:
Because access to https://sourceware.org/git/valgrind.git can be really slow especially in some regions

Pull Request resolved: #45914

Reviewed By: seemethere

Differential Revision: D24144420

Pulled By: malfet

fbshipit-source-id: a454c8c3182c570ec344bf6468bb5e55d8b8da79
Summary:
Fixes #45724

Pull Request resolved: #46001

Reviewed By: mruberry

Differential Revision: D24184058

Pulled By: ngimel

fbshipit-source-id: 7d2bab3206ddbc10a7cae3efd9b5e253f38400a9
)

Summary:

Note: This PR has been merged into master at b5a2f04 after the 1.7 branch cut
(see original PR: #45642). This PR is to merge it into the 1.7 branch.

---- Original Commit Description Follows ---

Pull Request resolved: #45642

Prior to #45181, initializing a
NCCL process group would work even if no GPUs were present. Although, now since
init_process_group calls `barrier()` this would fail.

In general the problem was that we could initialize ProcessGroupNCCL without
GPUs and then if we called a method like `barrier()` the process would crash
since we do % numGPUs resulting in division by zero.
ghstack-source-id: 113490343

Test Plan: waitforbuildbot

Reviewed By: osalpekar

Differential Revision: D24038839

fbshipit-source-id: a1f1db52cabcfb83e06c1a11ae9744afbf03f8dc
@facebook-github-bot facebook-github-bot added oncall: jit Add this issue/PR to JIT oncall triage queue oncall: distributed Add this issue/PR to distributed oncall triage queue fx labels Oct 9, 2020
@dr-ci
Copy link

dr-ci bot commented Oct 9, 2020

💊 CI failures summary and remediations

As of commit 9af59ca (more details on the Dr. CI page):


  • 3/3 failures possibly* introduced in this PR
    • 3/3 non-CircleCI failure(s)

Extra GitHub checks: 1 failed


ci.pytorch.org: 1 failed


codecov.io: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 4 times.

@albanD
Copy link
Collaborator

albanD commented Oct 9, 2020

I think there is a rebase issue no?

@pritamdamania87
Copy link
Contributor Author

I think there is a rebase issue no?

Yes, I messed up something. Will figure it out and resubmit.

@pritamdamania87
Copy link
Contributor Author

This should be the correct PR for cherry-pick: #46073 I updated the release tracker as well.

@facebook-github-bot facebook-github-bot deleted the nccl_gpu_1.7_cherry_pick branch January 27, 2021 18:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

oncall: distributed Add this issue/PR to distributed oncall triage queue oncall: jit Add this issue/PR to JIT oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.