-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Revert "Updates assertEqual to use torch.isclose-like logic (#37294)" #38689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
💊 CI failures summary and remediationsAs of commit 1eaef33 (more details on the Dr. CI page):
❄️ 1 failure tentatively classified as flakybut reruns have not yet been triggered to confirm:
|
|
I'm confused how the PR that this reverts is affecting test_nccl and ROCm? When I look at the CI it seems like the tensors being compared in test_nccl are very different from each other. Are the tensors actually not that different? |
|
If I'd make a guess, the original failure might due to the following lines that changed the test coverage on ROCm path: |
All of these cited changes are in test_torch.py, though, and would have no effect on test_nccl.py. |
|
Looks like the tests in question aren't being run because of another failure: |
|
@mruberry Which log is the above snippet from? |
The subtest 2 under the rocmdeb-ubuntu16.04 tests, the one that would run the NCCL tests. |
|
Thanks. The actual reason for the failure is a memory access fault: @jeffdaily This is most likely same as the known issue being debugged wrt mgpu memory access faults on ROCm. |
|
I've been able to reproduce the assertion error at commit 9cfc10d. However, it is only reproducible when running the test_nccl suite as a whole, but not when running the same tests individually. This points to a potential issue with RCCL buffers or some other syncing issue. [EDIT] I did rebuild pytorch as just one commit before 9cfc10d, and I couldn't reproduce the error. That would seem to point to something in that commit causing this, but I didn't get far enough to identify what exactly it is. |
|
It's important that we fully understand why this commit caused test_nccl.py to fail. Testing locally I could reproduce. Blacklisting this test allowed all remaining tests to pass. |
|
@mruberry @ezyang I have narrowed down the reason why commit 9cfc10d causes the test_nccl failure: the assertion logic has now changed to comparing tensors on the CPU as opposed to the GPU when
Pre - 9cfc10d:
Still debugging the root cause. |
|
@jithunnair-amd that's correct. Tests (like this one?) were and still are relying on assertEqual to synchronize devices for them. This is incorrect, however. In the future we'll probably default the exact_device kwarg to True to ensure that tests perform their own cross-device synchronization or explicitly indicate they don't care about how the synchronization is performed. |
…_nccl tests (#39354) Summary: All individual test_nccl unit tests have been disabled for ROCm in bf93954 test_nccl was also added to the ROCM_BLACKLIST in 87b198d However, the issue only arises when running the test_nccl suite as a whole (as opposed to any one test individually). More details in comments here: #38689 This PR enables test_nccl suite with only two tests so as to workaround the as-yet unresolved issue above, while allowing at least one test_nccl collective test to run on ROCm. This is also needed as a precursor for: #38515 Pull Request resolved: #39354 Differential Revision: D21843194 Pulled By: mrshenli fbshipit-source-id: b28d1e073d8d0fdc1b59928fc3b00187cfd02a35
This reverts commit 9cfc10d.
This is breaking ROCm CI. Notably, TestNCCL.
@mruberry @ezyang @xw285cornell