Skip to content

Conversation

@jeffdaily
Copy link
Collaborator

This reverts commit 9cfc10d.

This is breaking ROCm CI. Notably, TestNCCL.

@mruberry @ezyang @xw285cornell

…37294)"

This reverts commit 9cfc10d.

This was breaking ROCm CI. Notably, TestNCCL.
@dr-ci
Copy link

dr-ci bot commented May 19, 2020

💊 CI failures summary and remediations

As of commit 1eaef33 (more details on the Dr. CI page):



❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

See CircleCI build caffe2_onnx_main_py3_6_clang7_ubuntu16_04_build (1/1)

Step: "Build" (full log | diagnosis details | 🔁 rerun) ❄️

May 19 00:09:56 fatal: The remote end hung up unexpectedly
DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/caffe2/py3.6-clang7-ubuntu16.04:376 
 
real	0m25.506s 
user	0m0.084s 
sys	0m0.040s 
May 19 00:09:17 ++ export BUILD_ENVIRONMENT=caffe2-onnx-main-py3.6-clang7-ubuntu16.04-build 
May 19 00:09:17 ++ BUILD_ENVIRONMENT=caffe2-onnx-main-py3.6-clang7-ubuntu16.04-build 
May 19 00:09:17 ++ git submodule sync 
May 19 00:09:17 ++ git submodule update -q --init --recursive 
May 19 00:09:56 error: RPC failed; curl 56 GnuTLS recv error (-54): Error in the pull function. 
May 19 00:09:56 fatal: The remote end hung up unexpectedly 
May 19 00:09:56 fatal: early EOF 
May 19 00:09:56 fatal: index-pack failed 
May 19 00:09:56 fatal: clone of 'https://github.com/protocolbuffers/protobuf.git' into submodule path 'third_party/protobuf' failed 

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 2 times.

@mruberry
Copy link
Collaborator

I'm confused how the PR that this reverts is affecting test_nccl and ROCm? When I look at the CI it seems like the tensors being compared in test_nccl are very different from each other. Are the tensors actually not that different?

@sunway513
Copy link

If I'd make a guess, the original failure might due to the following lines that changed the test coverage on ROCm path:
https://github.com/pytorch/pytorch/pull/38689/files#diff-9996665f82f52030836eb8657057cfadL6530
https://github.com/pytorch/pytorch/pull/38689/files#diff-9996665f82f52030836eb8657057cfadL5615
Or this line that changed the tolerance settings on GPUs:
https://github.com/pytorch/pytorch/pull/38689/files#diff-9996665f82f52030836eb8657057cfadR11461
cc @jithunnair-amd

@mruberry
Copy link
Collaborator

If I'd make a guess, the original failure might due to the following lines that changed the test coverage on ROCm path:
https://github.com/pytorch/pytorch/pull/38689/files#diff-9996665f82f52030836eb8657057cfadL6530
https://github.com/pytorch/pytorch/pull/38689/files#diff-9996665f82f52030836eb8657057cfadL5615
Or this line that changed the tolerance settings on GPUs:
https://github.com/pytorch/pytorch/pull/38689/files#diff-9996665f82f52030836eb8657057cfadR11461
cc @jithunnair-amd

All of these cited changes are in test_torch.py, though, and would have no effect on test_nccl.py.

@mruberry
Copy link
Collaborator

mruberry commented May 19, 2020

Looks like the tests in question aren't being run because of another failure:

01:30:41 ======================================================================
01:30:41 ERROR: test_backend_full_group (__main__.TestDistBackend)
01:30:41 ----------------------------------------------------------------------
01:30:41 Traceback (most recent call last):
01:30:41   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 204, in wrapper
01:30:41     self._join_processes(fn)
01:30:41   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 306, in _join_processes
01:30:41     self._check_return_codes(elapsed_time)
01:30:41   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 339, in _check_return_codes
01:30:41     raise RuntimeError(error)
01:30:41 RuntimeError: Processes 1 2 exited with error code 10
01:30:41 
01:30:41 ----------------------------------------------------------------------

@jithunnair-amd
Copy link
Collaborator

@mruberry Which log is the above snippet from?

@mruberry
Copy link
Collaborator

@mruberry Which log is the above snippet from?

https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py3.6-clang7-rocmdeb-ubuntu16.04-test2/28334/console

The subtest 2 under the rocmdeb-ubuntu16.04 tests, the one that would run the NCCL tests.

@jithunnair-amd
Copy link
Collaborator

Thanks. The actual reason for the failure is a memory access fault:

01:29:11 test_backend_full_group (__main__.TestDistBackend) ... Memory exception on virtual address 0x7f61a970c000, node id 4 : Page not present
01:29:11 GPU address 0x7f61a970c000, node id 0, size in byte 0x1000
01:29:11 Memory is allocated using hsaKmtAllocMemory
01:29:11 CPU address of the memory is 0x7f61a970c000
01:29:11 Memory is mapped to node id: 4 5 6 7 
01:29:11 Memory access fault by GPU node-4 (Agent handle: 0x3c0fa60) on address 0x7f61a970c000. Reason: Page not present or supervisor privilege.

@jeffdaily This is most likely same as the known issue being debugged wrt mgpu memory access faults on ROCm.
@mruberry I'm attempting to reproduce the test_nccl failure at 9cfc10d. I'll get back with some details.

@mruberry
Copy link
Collaborator

@mruberry I'm attempting to reproduce the test_nccl failure at 9cfc10d. I'll get back with some details.

Cool, when you run the test would you also post the tensor values that are being compared? I'm curious if they actually are different or not.

@jithunnair-amd
Copy link
Collaborator

jithunnair-amd commented May 19, 2020

I've been able to reproduce the assertion error at commit 9cfc10d. However, it is only reproducible when running the test_nccl suite as a whole, but not when running the same tests individually. This points to a potential issue with RCCL buffers or some other syncing issue.

[EDIT] I did rebuild pytorch as just one commit before 9cfc10d, and I couldn't reproduce the error. That would seem to point to something in that commit causing this, but I didn't get far enough to identify what exactly it is.

@jeffdaily
Copy link
Collaborator Author

It's important that we fully understand why this commit caused test_nccl.py to fail. Testing locally I could reproduce. Blacklisting this test allowed all remaining tests to pass.
Closing this PR in favor of #38730.

@jeffdaily jeffdaily closed this May 19, 2020
facebook-github-bot pushed a commit that referenced this pull request May 19, 2020
Summary:
CC ezyang xw285cornell sunway513

Work-around for recent ROCm CI failures due to 9cfc10d (#37294).  Replaces full revert suggested by PR #38689.
Pull Request resolved: #38730

Differential Revision: D21648707

Pulled By: xw285cornell

fbshipit-source-id: 627b11b229c7eadca1f6e0c6192c6b5b6416e6a1
@jithunnair-amd
Copy link
Collaborator

@mruberry @ezyang I have narrowed down the reason why commit 9cfc10d causes the test_nccl failure: the assertion logic has now changed to comparing tensors on the CPU as opposed to the GPU when a is a CUDA tensor and b is a CPU tensor:
Post - 9cfc10d:


Pre - 9cfc10d:

Still debugging the root cause.

@mruberry
Copy link
Collaborator

@jithunnair-amd that's correct. Tests (like this one?) were and still are relying on assertEqual to synchronize devices for them. This is incorrect, however. In the future we'll probably default the exact_device kwarg to True to ensure that tests perform their own cross-device synchronization or explicitly indicate they don't care about how the synchronization is performed.

facebook-github-bot pushed a commit that referenced this pull request Jun 5, 2020
…_nccl tests (#39354)

Summary:
All individual test_nccl unit tests have been disabled for ROCm in bf93954
test_nccl was also added to the ROCM_BLACKLIST in 87b198d
However, the issue only arises when running the test_nccl suite as a whole (as opposed to any one test individually). More details in comments here: #38689

This PR enables test_nccl suite with only two tests so as to workaround the as-yet unresolved issue above, while allowing at least one test_nccl collective test to run on ROCm. This is also needed as a precursor for: #38515
Pull Request resolved: #39354

Differential Revision: D21843194

Pulled By: mrshenli

fbshipit-source-id: b28d1e073d8d0fdc1b59928fc3b00187cfd02a35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants