Fix flakiness with test_binary_op_list_error_cases #129003

janeyx99 · 2024-06-18T22:09:34Z

So how come this PR fixes any flakiness?

Well, following my investigation (read pt 1 in the linked ghstack PR below), I had realized that this test only consistently errors after another test was found flaky.

Why? Because TORCH_SHOW_CPP_STACKTRACES=1 gets turned on for every test after any test reruns, following this PR #119408. And yea, this test checked for exact error message matching, which no longer would match since the stacktrace for a foreach function is obviously going to be different from a nonforeach.

So we improve the test.

Stack from ghstack (oldest at bottom):

-> Fix flakiness with test_binary_op_list_error_cases #129003

[ghstack-poisoned]

pytorch-bot · 2024-06-18T22:09:37Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129003

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Failure with setup-ssh on Amazon Linux 2023 runners

✅ You can merge normally! (1 Unrelated Failure)

As of commit 8f37b6d with merge base 8c25426 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / linux-focal-rocm6.1-py3.8 / test (distributed, 1, 1, linux.rocm.gpu) (gh) (similar failure)
distributed/test_c10d_nccl.py::ProcessGroupNCCLGroupTest::test_close_pg

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: bce105b Pull Request resolved: #129003

soulitzer · 2024-06-20T15:20:29Z

And yea, this test checked for exact error message matching, which no longer would match since the stacktrace for a foreach function is obviously going to be different from a nonforeach

Nice catch!

janeyx99 · 2024-06-20T15:28:35Z

@pytorchbot merge -i

pytorchmergebot · 2024-06-20T15:30:11Z

Merge started

Your change will be merged while ignoring the following 1 checks: pull / linux-jammy-py3.8-gcc11 / test (distributed, 1, 2, linux.2xlarge)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-06-20T15:35:37Z

Merge failed

Reason: 2 jobs have failed, first few of them are: trunk / libtorch-linux-focal-cuda12.4-py3.7-gcc9-debug / build, trunk / libtorch-linux-focal-cuda12.1-py3.7-gcc9-debug / build

Details for Dev Infra team

Raised by workflow job

janeyx99 · 2024-06-20T15:51:09Z

@pytorchbot merge -r

pytorchmergebot · 2024-06-20T15:53:13Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

[ghstack-poisoned]

pytorchmergebot · 2024-06-20T15:53:27Z

Successfully rebased gh/janeyx99/174/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/129003)

pytorchmergebot · 2024-06-20T15:54:37Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-06-20T15:59:52Z

Merge failed

Reason: 6 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

janeyx99 · 2024-06-20T18:37:28Z

@pytorchbot merge -r

pytorchmergebot · 2024-06-20T18:38:52Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-06-20T18:39:00Z

Tried to rebase and push PR #129003, but it was already up to date. Try rebasing against main by issuing:
@pytorchbot rebase -b main

pytorchmergebot · 2024-06-20T18:39:01Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

janeyx99 · 2024-06-20T18:41:18Z

@pytorchbot rebase -b main

pytorchmergebot · 2024-06-20T18:42:39Z

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

[ghstack-poisoned]

pytorchmergebot · 2024-06-20T18:42:51Z

Successfully rebased gh/janeyx99/174/orig onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/129003)

ghstack-source-id: beba2e3 Pull Request resolved: #129003

janeyx99 · 2024-06-20T18:45:28Z

@pytorchbot merge

pytorchmergebot · 2024-06-20T18:47:01Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Reenable foreach tests on non-sm86 machines. I believe I've fixed the flakes that are caused when TORCH_SHOW_CPP_STACKTRACES=1, though I know clee2000 had also just landed #129004 for the same effect. Regardless, this makes the foreach tests more robust against future disruptions anyway. Fix similar in flavor to #129003 [ghstack-poisoned]

@clee2000

Reenable foreach tests on non-sm86 machines. I believe I've fixed the flakes that are caused when TORCH_SHOW_CPP_STACKTRACES=1, though I know @clee2000 had also just landed #129004 for the same effect. Regardless, this makes the foreach tests more robust against future disruptions anyway. Fix similar in flavor to #129003 Pull Request resolved: #130277 Approved by: https://github.com/soulitzer

Fix flakiness with test_binary_op_list_error_cases

073afb2

[ghstack-poisoned]

janeyx99 mentioned this pull request Jun 18, 2024

Improve the debugging message for when foreach mta_called #128991

Closed

pytorch-bot bot added the topic: not user facing topic category label Jun 18, 2024

janeyx99 added a commit that referenced this pull request Jun 18, 2024

Fix flakiness with test_binary_op_list_error_cases

b0e4d38

ghstack-source-id: bce105b Pull Request resolved: #129003

soulitzer approved these changes Jun 20, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 20, 2024

pytorchmergebot added the merging label Jun 20, 2024

pytorchmergebot removed the merging label Jun 20, 2024

Update

e25cbbc

[ghstack-poisoned]

pytorchmergebot added the merging label Jun 20, 2024

pytorchmergebot removed the merging label Jun 20, 2024

Update

8f37b6d

[ghstack-poisoned]

pytorchmergebot pushed a commit that referenced this pull request Jun 20, 2024

Fix flakiness with test_binary_op_list_error_cases

41a5d51

ghstack-source-id: beba2e3 Pull Request resolved: #129003

pytorchmergebot added the merging label Jun 20, 2024

pytorchmergebot added the Merged label Jun 20, 2024

pytorchmergebot closed this in adc14ad Jun 20, 2024

pytorchmergebot removed the merging label Jun 20, 2024

janeyx99 mentioned this pull request Jul 8, 2024

Fix the rest of foreach flakers #130277

Closed

github-actions bot deleted the gh/janeyx99/174/head branch July 21, 2024 02:01

janeyx99 mentioned this pull request Jul 24, 2024

DISABLED test_binary_op_list_error_cases__foreach_add_cuda (__main__.TestForeachCUDA) #124075

Closed

Fix flakiness with test_binary_op_list_error_cases #129003

Fix flakiness with test_binary_op_list_error_cases #129003

Uh oh!

Conversation

janeyx99 commented Jun 18, 2024 • edited by pytorchmergebot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129003

❗ 1 Active SEVs

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

soulitzer commented Jun 20, 2024

Uh oh!

janeyx99 commented Jun 20, 2024

Uh oh!

pytorchmergebot commented Jun 20, 2024

Merge started

Uh oh!

pytorchmergebot commented Jun 20, 2024

Merge failed

Uh oh!

janeyx99 commented Jun 20, 2024

Uh oh!

pytorchmergebot commented Jun 20, 2024

Uh oh!

pytorchmergebot commented Jun 20, 2024

Uh oh!

pytorchmergebot commented Jun 20, 2024

Merge started

Uh oh!

pytorchmergebot commented Jun 20, 2024

Merge failed

Uh oh!

janeyx99 commented Jun 20, 2024

Uh oh!

pytorchmergebot commented Jun 20, 2024

Uh oh!

pytorchmergebot commented Jun 20, 2024

Uh oh!

pytorchmergebot commented Jun 20, 2024

Uh oh!

janeyx99 commented Jun 20, 2024

Uh oh!

pytorchmergebot commented Jun 20, 2024

Uh oh!

pytorchmergebot commented Jun 20, 2024

Uh oh!

janeyx99 commented Jun 20, 2024

Uh oh!

pytorchmergebot commented Jun 20, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

janeyx99 commented Jun 18, 2024 •

edited by pytorchmergebot

Loading

pytorch-bot bot commented Jun 18, 2024 •

edited

Loading