Revert "Updates assertEqual to use torch.isclose-like logic (#37294)" #38689

jeffdaily · 2020-05-19T00:05:15Z

This reverts commit 9cfc10d.

This is breaking ROCm CI. Notably, TestNCCL.

@mruberry @ezyang @xw285cornell

…37294)" This reverts commit 9cfc10d. This was breaking ROCm CI. Notably, TestNCCL.

dr-ci · 2020-05-19T00:16:42Z

💊 CI failures summary and remediations

As of commit 1eaef33 (more details on the Dr. CI page):

1/2 failures possibly* introduced in this PR
- 1/1 non-CircleCI failure(s)
1/2 tentatively recognized as flaky ❄️
- Click here to rerun these jobs

❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

caffe2_onnx_main_py3_6_clang7_ubuntu16_04_build (1/1)

Step: "Build" (full log | diagnosis details | 🔁 rerun) ❄️

May 19 00:09:56 fatal: The remote end hung up unexpectedly

DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/caffe2/py3.6-clang7-ubuntu16.04:376 
 
real	0m25.506s 
user	0m0.084s 
sys	0m0.040s 
May 19 00:09:17 ++ export BUILD_ENVIRONMENT=caffe2-onnx-main-py3.6-clang7-ubuntu16.04-build 
May 19 00:09:17 ++ BUILD_ENVIRONMENT=caffe2-onnx-main-py3.6-clang7-ubuntu16.04-build 
May 19 00:09:17 ++ git submodule sync 
May 19 00:09:17 ++ git submodule update -q --init --recursive 
May 19 00:09:56 error: RPC failed; curl 56 GnuTLS recv error (-54): Error in the pull function. 
May 19 00:09:56 fatal: The remote end hung up unexpectedly 
May 19 00:09:56 fatal: early EOF 
May 19 00:09:56 fatal: index-pack failed 
May 19 00:09:56 fatal: clone of 'https://github.com/protocolbuffers/protobuf.git' into submodule path 'third_party/protobuf' failed

ci.pytorch.org: 1 failed

Failed: pr/py3.6-clang7-rocmdeb-ubuntu16.04

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 2 times.

mruberry · 2020-05-19T00:21:11Z

I'm confused how the PR that this reverts is affecting test_nccl and ROCm? When I look at the CI it seems like the tensors being compared in test_nccl are very different from each other. Are the tensors actually not that different?

sunway513 · 2020-05-19T01:01:41Z

If I'd make a guess, the original failure might due to the following lines that changed the test coverage on ROCm path:
https://github.com/pytorch/pytorch/pull/38689/files#diff-9996665f82f52030836eb8657057cfadL6530
https://github.com/pytorch/pytorch/pull/38689/files#diff-9996665f82f52030836eb8657057cfadL5615
Or this line that changed the tolerance settings on GPUs:
https://github.com/pytorch/pytorch/pull/38689/files#diff-9996665f82f52030836eb8657057cfadR11461
cc @jithunnair-amd

mruberry · 2020-05-19T01:14:03Z

If I'd make a guess, the original failure might due to the following lines that changed the test coverage on ROCm path:
https://github.com/pytorch/pytorch/pull/38689/files#diff-9996665f82f52030836eb8657057cfadL6530
https://github.com/pytorch/pytorch/pull/38689/files#diff-9996665f82f52030836eb8657057cfadL5615
Or this line that changed the tolerance settings on GPUs:
https://github.com/pytorch/pytorch/pull/38689/files#diff-9996665f82f52030836eb8657057cfadR11461
cc @jithunnair-amd

All of these cited changes are in test_torch.py, though, and would have no effect on test_nccl.py.

mruberry · 2020-05-19T01:53:12Z

Looks like the tests in question aren't being run because of another failure:

01:30:41 ======================================================================
01:30:41 ERROR: test_backend_full_group (__main__.TestDistBackend)
01:30:41 ----------------------------------------------------------------------
01:30:41 Traceback (most recent call last):
01:30:41   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 204, in wrapper
01:30:41     self._join_processes(fn)
01:30:41   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 306, in _join_processes
01:30:41     self._check_return_codes(elapsed_time)
01:30:41   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 339, in _check_return_codes
01:30:41     raise RuntimeError(error)
01:30:41 RuntimeError: Processes 1 2 exited with error code 10
01:30:41 
01:30:41 ----------------------------------------------------------------------

jithunnair-amd · 2020-05-19T03:36:36Z

@mruberry Which log is the above snippet from?

mruberry · 2020-05-19T03:47:07Z

@mruberry Which log is the above snippet from?

https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py3.6-clang7-rocmdeb-ubuntu16.04-test2/28334/console

The subtest 2 under the rocmdeb-ubuntu16.04 tests, the one that would run the NCCL tests.

jithunnair-amd · 2020-05-19T03:58:18Z

Thanks. The actual reason for the failure is a memory access fault:

01:29:11 test_backend_full_group (__main__.TestDistBackend) ... Memory exception on virtual address 0x7f61a970c000, node id 4 : Page not present
01:29:11 GPU address 0x7f61a970c000, node id 0, size in byte 0x1000
01:29:11 Memory is allocated using hsaKmtAllocMemory
01:29:11 CPU address of the memory is 0x7f61a970c000
01:29:11 Memory is mapped to node id: 4 5 6 7 
01:29:11 Memory access fault by GPU node-4 (Agent handle: 0x3c0fa60) on address 0x7f61a970c000. Reason: Page not present or supervisor privilege.

@jeffdaily This is most likely same as the known issue being debugged wrt mgpu memory access faults on ROCm.
@mruberry I'm attempting to reproduce the test_nccl failure at 9cfc10d. I'll get back with some details.

mruberry · 2020-05-19T04:01:23Z

@mruberry I'm attempting to reproduce the test_nccl failure at 9cfc10d. I'll get back with some details.

Cool, when you run the test would you also post the tensor values that are being compared? I'm curious if they actually are different or not.

jithunnair-amd · 2020-05-19T04:55:15Z

I've been able to reproduce the assertion error at commit 9cfc10d. However, it is only reproducible when running the test_nccl suite as a whole, but not when running the same tests individually. This points to a potential issue with RCCL buffers or some other syncing issue.

[EDIT] I did rebuild pytorch as just one commit before 9cfc10d, and I couldn't reproduce the error. That would seem to point to something in that commit causing this, but I didn't get far enough to identify what exactly it is.

jeffdaily · 2020-05-19T18:44:18Z

It's important that we fully understand why this commit caused test_nccl.py to fail. Testing locally I could reproduce. Blacklisting this test allowed all remaining tests to pass.
Closing this PR in favor of #38730.

Summary: CC ezyang xw285cornell sunway513 Work-around for recent ROCm CI failures due to 9cfc10d (#37294). Replaces full revert suggested by PR #38689. Pull Request resolved: #38730 Differential Revision: D21648707 Pulled By: xw285cornell fbshipit-source-id: 627b11b229c7eadca1f6e0c6192c6b5b6416e6a1

jithunnair-amd · 2020-05-21T18:06:47Z

@mruberry @ezyang I have narrowed down the reason why commit 9cfc10d causes the test_nccl failure: the assertion logic has now changed to comparing tensors on the CPU as opposed to the GPU when a is a CUDA tensor and b is a CPU tensor:
Post - 9cfc10d:

pytorch/torch/testing/_internal/common_utils.py

Line 932 in 9cfc10d

a = a.cpu()

Pre - 9cfc10d:

pytorch/torch/testing/_internal/common_utils.py

Line 949 in 6a23214

b = b.to(a)

Still debugging the root cause.

mruberry · 2020-05-21T18:38:07Z

@jithunnair-amd that's correct. Tests (like this one?) were and still are relying on assertEqual to synchronize devices for them. This is incorrect, however. In the future we'll probably default the exact_device kwarg to True to ensure that tests perform their own cross-device synchronization or explicitly indicate they don't care about how the synchronization is performed.

…_nccl tests (#39354) Summary: All individual test_nccl unit tests have been disabled for ROCm in bf93954 test_nccl was also added to the ROCM_BLACKLIST in 87b198d However, the issue only arises when running the test_nccl suite as a whole (as opposed to any one test individually). More details in comments here: #38689 This PR enables test_nccl suite with only two tests so as to workaround the as-yet unresolved issue above, while allowing at least one test_nccl collective test to run on ROCm. This is also needed as a precursor for: #38515 Pull Request resolved: #39354 Differential Revision: D21843194 Pulled By: mrshenli fbshipit-source-id: b28d1e073d8d0fdc1b59928fc3b00187cfd02a35

Revert "Updates assertEqual to use torch.isclose-like logic (pytorch#…

1eaef33

…37294)" This reverts commit 9cfc10d. This was breaking ROCm CI. Notably, TestNCCL.

jeffdaily mentioned this pull request May 19, 2020

Updates assertEqual to use torch.isclose-like logic #37294

Closed

pytorchbot added the open source label May 19, 2020

ezyang requested a review from mruberry May 19, 2020 14:36

jeffdaily mentioned this pull request May 19, 2020

add distributed/test_nccl to ROCM_BLACKLIST #38730

Closed

jeffdaily closed this May 19, 2020

jithunnair-amd mentioned this pull request Jun 1, 2020

Remove test_nccl from ROCM_BLACKLIST and enable only a couple of test_nccl tests #39354

Closed

Revert "Updates assertEqual to use torch.isclose-like logic (#37294)" #38689

Revert "Updates assertEqual to use torch.isclose-like logic (#37294)" #38689

Uh oh!

Conversation

jeffdaily commented May 19, 2020

Uh oh!

dr-ci bot commented May 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

❄️ 1 failure tentatively classified as flaky

caffe2_onnx_main_py3_6_clang7_ubuntu16_04_build (1/1)

ci.pytorch.org: 1 failed

Uh oh!

mruberry commented May 19, 2020

Uh oh!

sunway513 commented May 19, 2020

Uh oh!

mruberry commented May 19, 2020

Uh oh!

mruberry commented May 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jithunnair-amd commented May 19, 2020

Uh oh!

mruberry commented May 19, 2020

Uh oh!

jithunnair-amd commented May 19, 2020

Uh oh!

mruberry commented May 19, 2020

Uh oh!

jithunnair-amd commented May 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffdaily commented May 19, 2020

Uh oh!

jithunnair-amd commented May 21, 2020

Uh oh!

mruberry commented May 21, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dr-ci bot commented May 19, 2020 •

edited

Loading

mruberry commented May 19, 2020 •

edited

Loading

jithunnair-amd commented May 19, 2020 •

edited

Loading