Skip to content

Conversation

@janeyx99
Copy link
Contributor

@janeyx99 janeyx99 commented Jun 18, 2024

So how come this PR fixes any flakiness?

Well, following my investigation (read pt 1 in the linked ghstack PR below), I had realized that this test only consistently errors after another test was found flaky.

Why? Because TORCH_SHOW_CPP_STACKTRACES=1 gets turned on for every test after any test reruns, following this PR #119408. And yea, this test checked for exact error message matching, which no longer would match since the stacktrace for a foreach function is obviously going to be different from a nonforeach.

So we improve the test.

Stack from ghstack (oldest at bottom):

@pytorch-bot
Copy link

pytorch-bot bot commented Jun 18, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129003

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ You can merge normally! (1 Unrelated Failure)

As of commit 8f37b6d with merge base 8c25426 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Jun 18, 2024
janeyx99 added a commit that referenced this pull request Jun 18, 2024
ghstack-source-id: bce105b
Pull Request resolved: #129003
@soulitzer
Copy link
Contributor

And yea, this test checked for exact error message matching, which no longer would match since the stacktrace for a foreach function is obviously going to be different from a nonforeach

Nice catch!

@janeyx99
Copy link
Contributor Author

@pytorchbot merge -i

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 20, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 1 checks: pull / linux-jammy-py3.8-gcc11 / test (distributed, 1, 2, linux.2xlarge)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 2 jobs have failed, first few of them are: trunk / libtorch-linux-focal-cuda12.4-py3.7-gcc9-debug / build, trunk / libtorch-linux-focal-cuda12.1-py3.7-gcc9-debug / build

Details for Dev Infra team Raised by workflow job

@janeyx99
Copy link
Contributor Author

@pytorchbot merge -r

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Successfully rebased gh/janeyx99/174/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/129003)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 6 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@janeyx99
Copy link
Contributor Author

@pytorchbot merge -r

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Tried to rebase and push PR #129003, but it was already up to date. Try rebasing against main by issuing:
@pytorchbot rebase -b main

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@janeyx99
Copy link
Contributor Author

@pytorchbot rebase -b main

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Successfully rebased gh/janeyx99/174/orig onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/129003)

pytorchmergebot pushed a commit that referenced this pull request Jun 20, 2024
ghstack-source-id: beba2e3
Pull Request resolved: #129003
@janeyx99
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

janeyx99 added a commit that referenced this pull request Jul 8, 2024
Reenable foreach tests on non-sm86 machines. I believe I've fixed the flakes that are caused when TORCH_SHOW_CPP_STACKTRACES=1, though I know clee2000 had also just landed #129004 for the same effect.

Regardless, this makes the foreach tests more robust against future disruptions anyway. Fix similar in flavor to #129003




[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Jul 9, 2024
Reenable foreach tests on non-sm86 machines. I believe I've fixed the flakes that are caused when TORCH_SHOW_CPP_STACKTRACES=1, though I know @clee2000 had also just landed #129004 for the same effect.

Regardless, this makes the foreach tests more robust against future disruptions anyway. Fix similar in flavor to #129003

Pull Request resolved: #130277
Approved by: https://github.com/soulitzer
@github-actions github-actions bot deleted the gh/janeyx99/174/head branch July 21, 2024 02:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants