Add cpp stack traces to our own reruns #119408

albanD · 2024-02-07T20:57:53Z

Note that I'm not sure why we both have pytest rerun the failing test twice via

pytorch/test/run_test.py

Line 966 in 81abc2b

rerun_options = ["-x", "--reruns=2"]

before our own logic retries it as well?

The failing test is only here to make sure it works as expected in the CI env. Will remove before landing.

pytorch-bot · 2024-02-07T20:57:56Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/119408

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 86ea539 with merge base 834c7a1 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

clee2000 · 2024-02-07T21:09:34Z

Note that I'm not sure why we both have pytest rerun the failing test twice via

The --reruns=2 reruns in same process, the reruns later are in a different process -> 3 tries per process x 3 processes = 9 tries total

The same process reruns are probably a bit overkill, but it's faster to rerun a test in the same process and hope its just flaky, and the new process reruns are useful for segfaults

Also there's a small added benefit that having both can help distinguish certain types of flakiness

huydhn

LGTM!

albanD · 2024-02-07T21:50:18Z

Tests are showing the c++ stack traces as expected for the last 6 retries.

albanD · 2024-02-07T21:50:52Z

@pytorchbot merge

pytorchmergebot · 2024-02-07T21:52:53Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-02-07T22:36:06Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / macos-12-py3-arm64 / test (default, 1, 3, macos-m1-12)

Details for Dev Infra team

Raised by workflow job

albanD · 2024-02-08T00:32:50Z

@pytorchbot merge -i

All failures are unrelated

pytorchmergebot · 2024-02-08T00:37:14Z

Merge started

Your change will be merged while ignoring the following 3 checks: pull / linux-jammy-py3.10-clang15-asan / test (default, 1, 6, linux.4xlarge), Lint / lintrunner / linux-job, trunk / macos-12-py3-arm64 / test (default, 1, 3, macos-m1-12)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

malfet · 2024-02-08T17:16:30Z

@pytorchbot revert -c "Looks like it introduced intermittent crashes, testing the theory" -c weird

pytorch-bot · 2024-02-08T17:16:33Z

❌ 🤖 pytorchbot command failed:

@pytorchbot revert: error: argument -c/--classification: invalid choice: 'Looks like it introduced intermittent crashes, testing the theory' (choose from 'nosignal', 'ignoredsignal', 'landrace', 'weird', 'ghfirst')

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst}

Try @pytorchbot --help for more info.

malfet · 2024-02-08T17:18:41Z

@pytorchbot revert -m "Looks like it introduced intermittent crashes see https://github.com/pytorch/pytorch/actions/runs/7823402867/job/21344456540 for example, testing the theory" -c weird

pytorchmergebot · 2024-02-08T17:20:32Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

This reverts commit fbe6f62. Reverted #119408 on behalf of https://github.com/malfet due to Looks like it introduced intermittent crashes see https://github.com/pytorch/pytorch/actions/runs/7823402867/job/21344456540 for example, testing the theory ([comment](#119408 (comment)))

pytorchmergebot · 2024-02-08T17:20:43Z

@albanD your PR has been successfully reverted.

Note that I'm not sure why we both have pytest rerun the failing test twice via https://github.com/pytorch/pytorch/blob/81abc2b2494ab7d48394b63d528eb5dddfa9d3d5/test/run_test.py#L966 before our own logic retries it as well? The failing test is only here to make sure it works as expected in the CI env. Will remove before landing. Pull Request resolved: #119408 Approved by: https://github.com/huydhn

This reverts commit fbe6f62. Reverted #119408 on behalf of https://github.com/malfet due to Looks like it introduced intermittent crashes see https://github.com/pytorch/pytorch/actions/runs/7823402867/job/21344456540 for example, testing the theory ([comment](#119408 (comment)))

clee2000 · 2024-02-21T00:51:26Z

@pytorchbot rebase -b main

pytorchmergebot · 2024-02-21T00:52:40Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-02-21T00:52:42Z

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/119408/head returned non-zero exit code 1

Rebasing (1/4)
Auto-merging test/run_test.py
CONFLICT (content): Merge conflict in test/run_test.py
Auto-merging test/test_autograd.py
error: could not apply 1c506117750... Add cpp stack traces to our own reruns
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
Could not apply 1c506117750... Add cpp stack traces to our own reruns

Raised by https://github.com/pytorch/pytorch/actions/runs/7982240840

pytorchmergebot · 2024-02-21T00:52:53Z

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

pytorchmergebot · 2024-02-21T00:52:57Z

Successfully rebased test_retry onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via git checkout test_retry && git pull --rebase)

This PR removes and adds some failures and successes that were hidden in the past week (ish). #119408 (47182a8) accidentally removed environment variables on rerun (see PR body of #120251 for slightly more details). Enabling testing with dynamo is set using an env var, so if a test failed with dynamo, it would rerun without the dynamo env var set, making it pass on retry. Normally, the flaky test bot would catch this and make an issue for the test, but the CI env var controls whether or not xml test reports get made, and that also got removed on rerun, so the xmls weren't made either. Pull Request resolved: #120271 Approved by: https://github.com/DanilBaibak, https://github.com/zou3519

albanD · 2024-02-21T17:44:03Z

Ho sorry about that. Let me know how I can help!

clee2000 · 2024-02-21T18:27:54Z

Ho sorry about that. Let me know how I can help!

I merged the fix in #120251, right now I'm just waiting for CI on main to be a bit greener. The PR shouldn't need any changes, except maybe a rebase if you want to be safe

albanD · 2024-02-21T18:52:35Z

@pytorchbot rebase

pytorchmergebot · 2024-02-21T18:54:06Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-02-21T18:54:08Z

Tried to rebase and push PR #119408, but it was already up to date. Try rebasing against main by issuing:
@pytorchbot rebase -b main

clee2000 · 2024-02-26T18:54:08Z

@pytorchbot rebase

pytorchmergebot · 2024-02-26T19:03:16Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-02-26T19:03:20Z

Successfully rebased test_retry onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout test_retry && git pull --rebase)

clee2000 · 2024-02-26T19:11:54Z

@pytorchbot merge

pytorchmergebot · 2024-02-26T19:13:54Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

So how come this PR fixes any flakiness? Well, following my investigation (read pt 1 in the linked ghstack PR below), I had realized that this test only consistently errors after another test was found flaky. Why? Because TORCH_SHOW_CPP_STACKTRACES=1 gets turned on for _every_ test after _any_ test reruns, following this PR #119408. And yea, this test checked for exact error message matching, which no longer would match since the stacktrace for a foreach function is obviously going to be different from a nonforeach. So we improve the test. Pull Request resolved: #129003 Approved by: https://github.com/soulitzer

albanD requested a review from a team as a code owner February 7, 2024 20:57

pytorch-bot bot added the topic: not user facing topic category label Feb 7, 2024

huydhn approved these changes Feb 7, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 7, 2024

pytorchmergebot added the merging label Feb 7, 2024

pytorchmergebot removed the merging label Feb 7, 2024

pytorchmergebot added the merging label Feb 8, 2024

pytorchmergebot closed this in fbe6f62 Feb 8, 2024

pytorchmergebot added Merged and removed merging labels Feb 8, 2024

pytorchmergebot added the Reverted label Feb 8, 2024

pytorchmergebot reopened this Feb 8, 2024

pytorchmergebot reopened this Feb 20, 2024

pytorchmergebot force-pushed the test_retry branch from 95c8ea4 to b70d6d5 Compare February 21, 2024 00:52

clee2000 mentioned this pull request Feb 21, 2024

Update dynamo_test_failures list #120271

Closed

albanD added 4 commits February 26, 2024 19:03

Add cpp stack traces to our own reruns

9b0e8e9

revert extra test failure

856c664

disable torchinductor_opinfo on macos as it segfault??

9da3b7b

Extend to all inductor tests as others segfault as well..

86ea539

pytorchmergebot force-pushed the test_retry branch from b70d6d5 to 86ea539 Compare February 26, 2024 19:03

pytorchmergebot added the merging label Feb 26, 2024

pytorchmergebot closed this in 30625ae Feb 26, 2024

pytorchmergebot removed the merging label Feb 26, 2024

janeyx99 mentioned this pull request Jun 20, 2024

Fix flakiness with test_binary_op_list_error_cases #129003

Closed

Add cpp stack traces to our own reruns #119408

Add cpp stack traces to our own reruns #119408

Uh oh!

Conversation

albanD commented Feb 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Feb 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/119408

✅ No Failures

Uh oh!

clee2000 commented Feb 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

huydhn left a comment

Choose a reason for hiding this comment

Uh oh!

albanD commented Feb 7, 2024

Uh oh!

albanD commented Feb 7, 2024

Uh oh!

pytorchmergebot commented Feb 7, 2024

Merge started

Uh oh!

pytorchmergebot commented Feb 7, 2024

Merge failed

Uh oh!

albanD commented Feb 8, 2024

Uh oh!

pytorchmergebot commented Feb 8, 2024

Merge started

Uh oh!

malfet commented Feb 8, 2024

Uh oh!

pytorch-bot bot commented Feb 8, 2024

Uh oh!

malfet commented Feb 8, 2024

Uh oh!

pytorchmergebot commented Feb 8, 2024

Uh oh!

pytorchmergebot commented Feb 8, 2024

Uh oh!

clee2000 commented Feb 21, 2024

Uh oh!

pytorchmergebot commented Feb 21, 2024

Uh oh!

pytorchmergebot commented Feb 21, 2024

Uh oh!

pytorchmergebot commented Feb 21, 2024

Uh oh!

pytorchmergebot commented Feb 21, 2024

Uh oh!

albanD commented Feb 21, 2024

Uh oh!

clee2000 commented Feb 21, 2024

Uh oh!

albanD commented Feb 21, 2024

Uh oh!

pytorchmergebot commented Feb 21, 2024

Uh oh!

pytorchmergebot commented Feb 21, 2024

Uh oh!

clee2000 commented Feb 26, 2024

Uh oh!

pytorchmergebot commented Feb 26, 2024

Uh oh!

pytorchmergebot commented Feb 26, 2024

Uh oh!

clee2000 commented Feb 26, 2024

Uh oh!

pytorchmergebot commented Feb 26, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

albanD commented Feb 7, 2024 •

edited

Loading

pytorch-bot bot commented Feb 7, 2024 •

edited

Loading

clee2000 commented Feb 7, 2024 •

edited

Loading