Skip to content

Conversation

@albanD
Copy link
Collaborator

@albanD albanD commented Feb 7, 2024

Note that I'm not sure why we both have pytest rerun the failing test twice via

rerun_options = ["-x", "--reruns=2"]
before our own logic retries it as well?

The failing test is only here to make sure it works as expected in the CI env. Will remove before landing.

@albanD albanD requested a review from a team as a code owner February 7, 2024 20:57
@pytorch-bot
Copy link

pytorch-bot bot commented Feb 7, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/119408

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 86ea539 with merge base 834c7a1 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Feb 7, 2024
@clee2000
Copy link
Contributor

clee2000 commented Feb 7, 2024

Note that I'm not sure why we both have pytest rerun the failing test twice via

The --reruns=2 reruns in same process, the reruns later are in a different process -> 3 tries per process x 3 processes = 9 tries total

The same process reruns are probably a bit overkill, but it's faster to rerun a test in the same process and hope its just flaky, and the new process reruns are useful for segfaults

Also there's a small added benefit that having both can help distinguish certain types of flakiness

Copy link
Contributor

@huydhn huydhn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@albanD
Copy link
Collaborator Author

albanD commented Feb 7, 2024

Tests are showing the c++ stack traces as expected for the last 6 retries.

@albanD
Copy link
Collaborator Author

albanD commented Feb 7, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 7, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / macos-12-py3-arm64 / test (default, 1, 3, macos-m1-12)

Details for Dev Infra team Raised by workflow job

@albanD
Copy link
Collaborator Author

albanD commented Feb 8, 2024

@pytorchbot merge -i

All failures are unrelated

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 3 checks: pull / linux-jammy-py3.10-clang15-asan / test (default, 1, 6, linux.4xlarge), Lint / lintrunner / linux-job, trunk / macos-12-py3-arm64 / test (default, 1, 3, macos-m1-12)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@malfet
Copy link
Contributor

malfet commented Feb 8, 2024

@pytorchbot revert -c "Looks like it introduced intermittent crashes, testing the theory" -c weird

@pytorch-bot
Copy link

pytorch-bot bot commented Feb 8, 2024

❌ 🤖 pytorchbot command failed:

@pytorchbot revert: error: argument -c/--classification: invalid choice: 'Looks like it introduced intermittent crashes, testing the theory' (choose from 'nosignal', 'ignoredsignal', 'landrace', 'weird', 'ghfirst')

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst}

Try @pytorchbot --help for more info.

@malfet
Copy link
Contributor

malfet commented Feb 8, 2024

@pytorchbot revert -m "Looks like it introduced intermittent crashes see https://github.com/pytorch/pytorch/actions/runs/7823402867/job/21344456540 for example, testing the theory" -c weird

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request Feb 8, 2024
This reverts commit fbe6f62.

Reverted #119408 on behalf of https://github.com/malfet due to Looks like it introduced intermittent crashes see https://github.com/pytorch/pytorch/actions/runs/7823402867/job/21344456540 for example, testing the theory ([comment](#119408 (comment)))
@pytorchmergebot
Copy link
Collaborator

@albanD your PR has been successfully reverted.

pytorch-bot bot pushed a commit that referenced this pull request Feb 8, 2024
Note that I'm not sure why we both have pytest rerun the failing test twice via https://github.com/pytorch/pytorch/blob/81abc2b2494ab7d48394b63d528eb5dddfa9d3d5/test/run_test.py#L966 before our own logic retries it as well?

The failing test is only here to make sure it works as expected in the CI env. Will remove before landing.
Pull Request resolved: #119408
Approved by: https://github.com/huydhn
pytorch-bot bot pushed a commit that referenced this pull request Feb 8, 2024
This reverts commit fbe6f62.

Reverted #119408 on behalf of https://github.com/malfet due to Looks like it introduced intermittent crashes see https://github.com/pytorch/pytorch/actions/runs/7823402867/job/21344456540 for example, testing the theory ([comment](#119408 (comment)))
@clee2000
Copy link
Contributor

@pytorchbot rebase -b main

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/119408/head returned non-zero exit code 1

Rebasing (1/4)
Auto-merging test/run_test.py
CONFLICT (content): Merge conflict in test/run_test.py
Auto-merging test/test_autograd.py
error: could not apply 1c506117750... Add cpp stack traces to our own reruns
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
Could not apply 1c506117750... Add cpp stack traces to our own reruns

Raised by https://github.com/pytorch/pytorch/actions/runs/7982240840

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased test_retry onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via git checkout test_retry && git pull --rebase)

pytorchmergebot pushed a commit that referenced this pull request Feb 21, 2024
This PR removes and adds some failures and successes that were hidden in the past week (ish).

#119408 (47182a8) accidentally removed environment variables on rerun (see PR body of #120251 for slightly more details).

Enabling testing with dynamo is set using an env var, so if a test failed with dynamo, it would rerun without the dynamo env var set, making it pass on retry.  Normally, the flaky test bot would catch this and make an issue for the test, but the CI env var controls whether or not xml test reports get made, and that also got removed on rerun, so the xmls weren't made either.

Pull Request resolved: #120271
Approved by: https://github.com/DanilBaibak, https://github.com/zou3519
@albanD
Copy link
Collaborator Author

albanD commented Feb 21, 2024

Ho sorry about that. Let me know how I can help!

@clee2000
Copy link
Contributor

Ho sorry about that. Let me know how I can help!

I merged the fix in #120251, right now I'm just waiting for CI on main to be a bit greener. The PR shouldn't need any changes, except maybe a rebase if you want to be safe

@albanD
Copy link
Collaborator Author

albanD commented Feb 21, 2024

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Tried to rebase and push PR #119408, but it was already up to date. Try rebasing against main by issuing:
@pytorchbot rebase -b main

@clee2000
Copy link
Contributor

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased test_retry onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout test_retry && git pull --rebase)

@clee2000
Copy link
Contributor

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Jun 20, 2024
So how come this PR fixes any flakiness?

Well, following my investigation (read pt 1 in the linked ghstack PR below), I had realized that this test only consistently errors after another test was found flaky.

Why? Because TORCH_SHOW_CPP_STACKTRACES=1 gets turned on for _every_ test after _any_ test reruns, following this PR #119408. And yea, this test checked for exact error message matching, which no longer would match since the stacktrace for a foreach function is obviously going to be different from a nonforeach.

So we improve the test.

Pull Request resolved: #129003
Approved by: https://github.com/soulitzer
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged Reverted topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants