-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Add cpp stack traces to our own reruns #119408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/119408
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 86ea539 with merge base 834c7a1 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
The --reruns=2 reruns in same process, the reruns later are in a different process -> 3 tries per process x 3 processes = 9 tries total The same process reruns are probably a bit overkill, but it's faster to rerun a test in the same process and hope its just flaky, and the new process reruns are useful for segfaults Also there's a small added benefit that having both can help distinguish certain types of flakiness |
huydhn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
Tests are showing the c++ stack traces as expected for the last 6 retries. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: trunk / macos-12-py3-arm64 / test (default, 1, 3, macos-m1-12) Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge -i All failures are unrelated |
Merge startedYour change will be merged while ignoring the following 3 checks: pull / linux-jammy-py3.10-clang15-asan / test (default, 1, 6, linux.4xlarge), Lint / lintrunner / linux-job, trunk / macos-12-py3-arm64 / test (default, 1, 3, macos-m1-12) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot revert -c "Looks like it introduced intermittent crashes, testing the theory" -c weird |
|
❌ 🤖 pytorchbot command failed: Try |
|
@pytorchbot revert -m "Looks like it introduced intermittent crashes see https://github.com/pytorch/pytorch/actions/runs/7823402867/job/21344456540 for example, testing the theory" -c weird |
|
@pytorchbot successfully started a revert job. Check the current status here. |
This reverts commit fbe6f62. Reverted #119408 on behalf of https://github.com/malfet due to Looks like it introduced intermittent crashes see https://github.com/pytorch/pytorch/actions/runs/7823402867/job/21344456540 for example, testing the theory ([comment](#119408 (comment)))
|
@albanD your PR has been successfully reverted. |
Note that I'm not sure why we both have pytest rerun the failing test twice via https://github.com/pytorch/pytorch/blob/81abc2b2494ab7d48394b63d528eb5dddfa9d3d5/test/run_test.py#L966 before our own logic retries it as well? The failing test is only here to make sure it works as expected in the CI env. Will remove before landing. Pull Request resolved: #119408 Approved by: https://github.com/huydhn
This reverts commit fbe6f62. Reverted #119408 on behalf of https://github.com/malfet due to Looks like it introduced intermittent crashes see https://github.com/pytorch/pytorch/actions/runs/7823402867/job/21344456540 for example, testing the theory ([comment](#119408 (comment)))
|
@pytorchbot rebase -b main |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Rebase failed due to Command Raised by https://github.com/pytorch/pytorch/actions/runs/7982240840 |
|
@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here |
|
Successfully rebased |
95c8ea4 to
b70d6d5
Compare
This PR removes and adds some failures and successes that were hidden in the past week (ish). #119408 (47182a8) accidentally removed environment variables on rerun (see PR body of #120251 for slightly more details). Enabling testing with dynamo is set using an env var, so if a test failed with dynamo, it would rerun without the dynamo env var set, making it pass on retry. Normally, the flaky test bot would catch this and make an issue for the test, but the CI env var controls whether or not xml test reports get made, and that also got removed on rerun, so the xmls weren't made either. Pull Request resolved: #120271 Approved by: https://github.com/DanilBaibak, https://github.com/zou3519
|
Ho sorry about that. Let me know how I can help! |
I merged the fix in #120251, right now I'm just waiting for CI on main to be a bit greener. The PR shouldn't need any changes, except maybe a rebase if you want to be safe |
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Successfully rebased |
b70d6d5 to
86ea539
Compare
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
So how come this PR fixes any flakiness? Well, following my investigation (read pt 1 in the linked ghstack PR below), I had realized that this test only consistently errors after another test was found flaky. Why? Because TORCH_SHOW_CPP_STACKTRACES=1 gets turned on for _every_ test after _any_ test reruns, following this PR #119408. And yea, this test checked for exact error message matching, which no longer would match since the stacktrace for a foreach function is obviously going to be different from a nonforeach. So we improve the test. Pull Request resolved: #129003 Approved by: https://github.com/soulitzer
Note that I'm not sure why we both have pytest rerun the failing test twice via
pytorch/test/run_test.py
Line 966 in 81abc2b
The failing test is only here to make sure it works as expected in the CI env. Will remove before landing.