Skip to content

Conversation

@sraikund16
Copy link
Contributor

@sraikund16 sraikund16 commented Aug 30, 2024

Summary:
Currently some jobs are encountering the following trace, P1539415198. This suggests that when we are parsing through tensors the path is prone to encountering an invalid address. This is is possibly occurring because for some reason the sizes() and strides() of a Tensor seem to not be of the same dimensions. We assume such when iterating through the shapes to get the Ivalue generator. When browsing some of the tensor implementations, I found that some of the size and stride paths are different which could be the cause of this issue. Regardless, the profiler should be flexible enough to handle such issues without bringing down the whole main thread.

If the crashes still persist, it will still give us a data point as to where they are occurring and we can rule out the strides/sizes as the culprit

Test Plan: This change doesn't break anything in the happy path, just makes sure the bad path is not exited abruptly. We should use this in order to debug what the events are having mismatching dimensions between sizes and strides.

Differential Revision: D62008788

cc @robieta @chaekit @aaronenyeshi @guotuofeng @guyang3532 @dzhulgakov @davidberard98 @briancoutinho @sanrise

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 30, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134862

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (26 Unrelated Failures)

As of commit de48ffb with merge base 1b9f51b (image):

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D62008788

@sraikund16 sraikund16 added topic: not user facing topic category oncall: profiler profiler-related issues (cpu, gpu, kineto) labels Aug 30, 2024
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D62008788

sraikund16 added a commit to sraikund16/pytorch that referenced this pull request Aug 30, 2024
Summary:
Pull Request resolved: pytorch#134862

Currently some jobs are encountering the following trace, P1539415198. This suggests that when we are parsing through tensors the path is prone to encountering an invalid address. This is is possibly occurring because for some reason the sizes() and strides() of a Tensor seem to not be of the same dimensions. We assume such when iterating through the shapes to get the Ivalue generator. When browsing some of the tensor implementations, I found that some of the size and stride paths are different which could be the cause of this issue. Regardless, the profiler should be flexible enough to handle such issues without bringing down the whole main thread.

If the crashes still persist, it will still give us a data point as to where they are occurring and we can rule out the strides/sizes as the culprit

Test Plan: This change doesn't break anything in the happy path, just makes sure the bad path is not exited abruptly. We should use this in order to debug what the events are having mismatching dimensions between sizes and strides.

Differential Revision: D62008788
@sraikund16 sraikund16 added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 30, 2024
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D62008788

sraikund16 added a commit to sraikund16/pytorch that referenced this pull request Aug 30, 2024
Summary:
Pull Request resolved: pytorch#134862

Currently some jobs are encountering the following trace, P1539415198. This suggests that when we are parsing through tensors the path is prone to encountering an invalid address. This is is possibly occurring because for some reason the sizes() and strides() of a Tensor seem to not be of the same dimensions. We assume such when iterating through the shapes to get the Ivalue generator. When browsing some of the tensor implementations, I found that some of the size and stride paths are different which could be the cause of this issue. Regardless, the profiler should be flexible enough to handle such issues without bringing down the whole main thread.

If the crashes still persist, it will still give us a data point as to where they are occurring and we can rule out the strides/sizes as the culprit

Test Plan: This change doesn't break anything in the happy path, just makes sure the bad path is not exited abruptly. We should use this in order to debug what the events are having mismatching dimensions between sizes and strides.

Differential Revision: D62008788
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D62008788

sraikund16 added a commit to sraikund16/pytorch that referenced this pull request Aug 30, 2024
Summary:
Pull Request resolved: pytorch#134862

Currently some jobs are encountering the following trace, P1539415198. This suggests that when we are parsing through tensors the path is prone to encountering an invalid address. This is is possibly occurring because for some reason the sizes() and strides() of a Tensor seem to not be of the same dimensions. We assume such when iterating through the shapes to get the Ivalue generator. When browsing some of the tensor implementations, I found that some of the size and stride paths are different which could be the cause of this issue. Regardless, the profiler should be flexible enough to handle such issues without bringing down the whole main thread.

If the crashes still persist, it will still give us a data point as to where they are occurring and we can rule out the strides/sizes as the culprit

Test Plan: This change doesn't break anything in the happy path, just makes sure the bad path is not exited abruptly. We should use this in order to debug what the events are having mismatching dimensions between sizes and strides.

Differential Revision: D62008788
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D62008788

sraikund16 added a commit to sraikund16/pytorch that referenced this pull request Aug 30, 2024
Summary:
Pull Request resolved: pytorch#134862

Currently some jobs are encountering the following trace, P1539415198. This suggests that when we are parsing through tensors the path is prone to encountering an invalid address. This is is possibly occurring because for some reason the sizes() and strides() of a Tensor seem to not be of the same dimensions. We assume such when iterating through the shapes to get the Ivalue generator. When browsing some of the tensor implementations, I found that some of the size and stride paths are different which could be the cause of this issue. Regardless, the profiler should be flexible enough to handle such issues without bringing down the whole main thread.

If the crashes still persist, it will still give us a data point as to where they are occurring and we can rule out the strides/sizes as the culprit

Test Plan: This change doesn't break anything in the happy path, just makes sure the bad path is not exited abruptly. We should use this in order to debug what the events are having mismatching dimensions between sizes and strides.

Differential Revision: D62008788
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D62008788

sraikund16 added a commit to sraikund16/pytorch that referenced this pull request Aug 30, 2024
Summary:
Pull Request resolved: pytorch#134862

Currently some jobs are encountering the following trace, P1539415198. This suggests that when we are parsing through tensors the path is prone to encountering an invalid address. This is is possibly occurring because for some reason the sizes() and strides() of a Tensor seem to not be of the same dimensions. We assume such when iterating through the shapes to get the Ivalue generator. When browsing some of the tensor implementations, I found that some of the size and stride paths are different which could be the cause of this issue. Regardless, the profiler should be flexible enough to handle such issues without bringing down the whole main thread.

If the crashes still persist, it will still give us a data point as to where they are occurring and we can rule out the strides/sizes as the culprit

Test Plan: This change doesn't break anything in the happy path, just makes sure the bad path is not exited abruptly. We should use this in order to debug what the events are having mismatching dimensions between sizes and strides.

Differential Revision: D62008788
Copy link
Member

@aaronenyeshi aaronenyeshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Added a few nits

"Tensor has mismatching sizes and strides. Sizes: ",
t.sizes(),
" Strides: ",
t.strides());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you mean to print the sizes and strides rather than their dimensions? Seems like we're checking dims here

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D62008788

sraikund16 added a commit to sraikund16/pytorch that referenced this pull request Sep 3, 2024
Summary:
Pull Request resolved: pytorch#134862

Currently some jobs are encountering the following trace, P1539415198. This suggests that when we are parsing through tensors the path is prone to encountering an invalid address. This is is possibly occurring because for some reason the sizes() and strides() of a Tensor seem to not be of the same dimensions. We assume such when iterating through the shapes to get the Ivalue generator. When browsing some of the tensor implementations, I found that some of the size and stride paths are different which could be the cause of this issue. Regardless, the profiler should be flexible enough to handle such issues without bringing down the whole main thread.

Test Plan: This change doesn't break anything in the happy path, just makes sure the bad path is not exited abruptly. We should use this in order to debug what the events are having mismatching dimensions between sizes and strides.

Reviewed By: aaronenyeshi

Differential Revision: D62008788
Summary:
Pull Request resolved: pytorch#134862

Currently some jobs are encountering the following trace, P1539415198. This suggests that when we are parsing through tensors the path is prone to encountering an invalid address. This is is possibly occurring because for some reason the sizes() and strides() of a Tensor seem to not be of the same dimensions. We assume such when iterating through the shapes to get the Ivalue generator. When browsing some of the tensor implementations, I found that some of the size and stride paths are different which could be the cause of this issue. Regardless, the profiler should be flexible enough to handle such issues without bringing down the whole main thread.

Test Plan: This change doesn't break anything in the happy path, just makes sure the bad path is not exited abruptly. We should use this in order to debug what the events are having mismatching dimensions between sizes and strides.

Reviewed By: aaronenyeshi

Differential Revision: D62008788
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D62008788

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 15 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@kit1980
Copy link
Contributor

kit1980 commented Sep 3, 2024

@pytorchbot merge -f "Merged internally, the errors are unrelated No matching distribution found for numpy==2.0.0rc1"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

tolleybot pushed a commit to tolleybot/pytorch that referenced this pull request Sep 14, 2024
Summary:
Currently some jobs are encountering the following trace, P1539415198. This suggests that when we are parsing through tensors the path is prone to encountering an invalid address. This is is possibly occurring because for some reason the sizes() and strides() of a Tensor seem to not be of the same dimensions. We assume such when iterating through the shapes to get the Ivalue generator. When browsing some of the tensor implementations, I found that some of the size and stride paths are different which could be the cause of this issue. Regardless, the profiler should be flexible enough to handle such issues without bringing down the whole main thread.

If the crashes still persist, it will still give us a data point as to where they are occurring and we can rule out the strides/sizes as the culprit

Test Plan: This change doesn't break anything in the happy path, just makes sure the bad path is not exited abruptly. We should use this in order to debug what the events are having mismatching dimensions between sizes and strides.

Differential Revision: D62008788

Pull Request resolved: pytorch#134862
Approved by: https://github.com/aaronenyeshi
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Sep 20, 2024
Summary:
Currently some jobs are encountering the following trace, P1539415198. This suggests that when we are parsing through tensors the path is prone to encountering an invalid address. This is is possibly occurring because for some reason the sizes() and strides() of a Tensor seem to not be of the same dimensions. We assume such when iterating through the shapes to get the Ivalue generator. When browsing some of the tensor implementations, I found that some of the size and stride paths are different which could be the cause of this issue. Regardless, the profiler should be flexible enough to handle such issues without bringing down the whole main thread.

If the crashes still persist, it will still give us a data point as to where they are occurring and we can rule out the strides/sizes as the culprit

Test Plan: This change doesn't break anything in the happy path, just makes sure the bad path is not exited abruptly. We should use this in order to debug what the events are having mismatching dimensions between sizes and strides.

Differential Revision: D62008788

Pull Request resolved: pytorch#134862
Approved by: https://github.com/aaronenyeshi
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged oncall: profiler profiler-related issues (cpu, gpu, kineto) topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants