-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Back out "Back out "free up dispatch key space (in C++)"" #74963
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Original commit changeset: b962de5d5eff Original Phabricator Diff: D35192346 Back out "Back out "DispatchKeySet perf improvements"" Original commit changeset: e38081810a56 Original Phabricator Diff: D35192317 Differential Revision: [D35222806](https://our.internmc.facebook.com/intern/diff/D35222806/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D35222806/)! [ghstack-poisoned]
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit 65b26fa (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
Original commit changeset: b962de5d5eff Original Phabricator Diff: D35192346 Back out "Back out "DispatchKeySet perf improvements"" Original commit changeset: e38081810a56 Original Phabricator Diff: D35192317 Differential Revision: [D35222806](https://our.internmc.facebook.com/intern/diff/D35222806/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D35222806/)! [ghstack-poisoned]
Original commit changeset: b962de5d5eff Original Phabricator Diff: D35192346 Back out "Back out "DispatchKeySet perf improvements"" Original commit changeset: e38081810a56 Original Phabricator Diff: D35192317 Differential Revision: [D35222806](https://our.internmc.facebook.com/intern/diff/D35222806/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D35222806/)! [ghstack-poisoned]
Original commit changeset: b962de5d5eff Original Phabricator Diff: D35192346 Back out "Back out "DispatchKeySet perf improvements"" Original commit changeset: e38081810a56 Original Phabricator Diff: D35192317 Differential Revision: [D35222806](https://our.internmc.facebook.com/intern/diff/D35222806/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D35222806/)! [ghstack-poisoned]
Pull Request resolved: #74963 Original commit changeset: b962de5d5eff Original Phabricator Diff: D35192346 Back out "Back out "DispatchKeySet perf improvements"" Original commit changeset: e38081810a56 Original Phabricator Diff: D35192317 ghstack-source-id: 152614490 Differential Revision: [D35222806](https://our.internmc.facebook.com/intern/diff/D35222806/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D35222806/)!
albanD
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes!
This PR is a re-land of #69633 (this is the second re-land attempt, the first one is at #72827). The original PR had a memory corruption bug that only surfaced on mobile builds. *Background: Existing Mobile Optimization* Pytorch mobile builds have an existing optimization ([here](https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/c10/core/DispatchKey.h#L382) and [here](https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/OperatorEntry.h#L214)), which works as follows: Every operator in pytorch has a "dispatch table" of function pointers, corresponding to all of the (up to 64) different kernels that we might dispatch to when we run an operator in pytorch (autograd, cpu, cuda, complex number support, etc). In mobile builds, the size of that table is shrunk from 64 to 8 to save a bunch of space, because mobile doesn't end up using the functionality associated with most dispatch keys. The dispatcher also has a notion of "fallback kernels", which are kernels that you can register to a particular dispatch key, but should be able to work for "any operator". The array of fallback kernels is defined [here](https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/Dispatcher.h#L294). The mobile-optimization currently does not extend to this array (it wouldn't be that useful anyway because there is only one array of fallback kernels globally - vs. there is a separate dispatch table of function pointers per operator). *The Bug* This PR actually makes it difficult to enable that optimization separately for the per-operator arrays vs. the fallback array, and incidentally shrunk the size of the fallback array from 64 to 8 for mobile (that happened on [this](https://github.com/pytorch/pytorch/pull/69633/files#diff-f735cd7aa68f15b624100cbc4bb3b5ea76ffc7c9d3bec3b0ccabaa09609e5319R294) line). That isn't a problem by itself (since mobile doesn't actually use any of the fallbacks that can no longer be stored). However, pytorch core will still register all of those fallback kernels on startup in mobile builds, even if they aren't used. When we tried to register one of those fallbacks on startup, it would try to dump the kernel somewhere in memory past the bounds of the (now smaller) array inside of the Dispatcher object, backendFallbackKernels_. **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D35222806/)! [ghstack-poisoned]
Pull Request resolved: #74963 This is a re-land of D35192346 and D35192317, which together are a diff that changes the internal representation of `DispatchKeySet` in pytorch core to free up the number of dispatch keys that we have available. See a more detailed description of the design in the original PR: #69633. The original PR broke Milan workflows, which use a pytorch mobile build, and manifested as a memory corruption bug inside of `liboacrmerged.so`. **Background: Existing Mobile Optimization** Pytorch mobile builds have an existing optimization (here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/c10/core/DispatchKey.h#L382 and here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/OperatorEntry.h#L214), which works as follows: Every operator in pytorch has a "dispatch table" of function pointers, corresponding to all of the (up to 64) different kernels that we might dispatch to when we run an operator in pytorch (autograd, cpu, cuda, complex number support, etc). In mobile builds, the size of that table is shrunk from 64 to 8 to save a bunch of space, because mobile doesn't end up using the functionality associated with most dispatch keys. The dispatcher also has a notion of "fallback kernels", which are kernels that you can register to a particular dispatch key, but should be able to work for "any operator". The array of fallback kernels is defined here: https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/Dispatcher.h#L294. The mobile-optimization currently does **not** extend to this array (it wouldn't be that useful anyway because there is only one array of fallback kernels globally - vs. there is a separate dispatch table of function pointers per operator). So the per-operator tables on mobile are size 8, while the fallback table is size 64. **The Bug** This PR actually makes it difficult to enable that optimization separately for the per-operator arrays vs. the fallback array, and incidentally shrunk the size of the fallback array from 64 to 8 for mobile (that happened on this line: https://github.com/pytorch/pytorch/pull/69633/files#diff-f735cd7aa68f15b624100cbc4bb3b5ea76ffc7c9d3bec3b0ccabaa09609e5319R294). That isn't a problem by itself (since mobile doesn't actually use any of the fallbacks that can no longer be stored). However, pytorch core will still register all of those fallback kernels on startup in mobile builds, even if they aren't used. When we tried to register one of those fallbacks on startup, it would try to dump the kernel somewhere in memory past the bounds of the (now smaller) array inside of the `Dispatcher` object, `backendFallbackKernels_`. **Why didn't this problem show up in OSS CI? Why didn't it break other internal mobile workflows aside from Milan?** Ideally, this failure would show up as part of the OSS signal on GitHub, since we already have mobile OSS builds. Given that it was another memory corruption issue that only affected Milan (subset of mobile), I'm not sure what's specific about Milan's builds that caused it only to manifest there. @dreiss I wonder if there's another flavor of mobile builds we could run in OSS CI that could potentially help catch this? **The debugging experience was pretty difficult** Debugging the Milan-specific failure was made difficult by the following: (1) lack of CI - the original Milan failure didn't surface on my original diff, because the Milan job(s) that failed weren't triggered to run on pytorch changes. There's probably a balance to strike here, since those jobs will only be useful if they aren't flaky, and if they can produce reliable failure logs for debugging. (2) It's difficult to get a repro. - my work laptop doesn't have the right specs to run the Milan development workflow (not enough disk space) - There is an existing OnDemand workflow for Milan, but it appears to be relatively new, and after a bunch of help from @mflporto, we ran into issues forwarding the log output from Milan tests on the emulator back to the terminal (see the original discussion here: https://fb.workplace.com/groups/OnDemandFRL/permalink/1424937774645433/) (3) Lack of stack-traces. - Most Milan failures didn't include actionable stack traces. @phding generously helped me debug by running my suggested patches locally, and reporting back if there were any failures. The failing test didn't include a stack trace though (just the line where the crash appeared), so I ended up making some educated guesses about what the issue was based on the area of the crash. Differential Revision: [D35222806](https://our.internmc.facebook.com/intern/diff/D35222806/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D35222806/)! ghstack-source-id: 152688542
Summary: X-link: pytorch/pytorch#74963 This is a re-land of D35192346 and D35192317, which together are a diff that changes the internal representation of `DispatchKeySet` in pytorch core to free up the number of dispatch keys that we have available. See a more detailed description of the design in the original PR: pytorch/pytorch#69633. The original PR broke Milan workflows, which use a pytorch mobile build, and manifested as a memory corruption bug inside of `liboacrmerged.so`. **Background: Existing Mobile Optimization** Pytorch mobile builds have an existing optimization (here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/c10/core/DispatchKey.h#L382 and here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/OperatorEntry.h#L214), which works as follows: Every operator in pytorch has a "dispatch table" of function pointers, corresponding to all of the (up to 64) different kernels that we might dispatch to when we run an operator in pytorch (autograd, cpu, cuda, complex number support, etc). In mobile builds, the size of that table is shrunk from 64 to 8 to save a bunch of space, because mobile doesn't end up using the functionality associated with most dispatch keys. The dispatcher also has a notion of "fallback kernels", which are kernels that you can register to a particular dispatch key, but should be able to work for "any operator". The array of fallback kernels is defined here: https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/Dispatcher.h#L294. The mobile-optimization currently does **not** extend to this array (it wouldn't be that useful anyway because there is only one array of fallback kernels globally - vs. there is a separate dispatch table of function pointers per operator). So the per-operator tables on mobile are size 8, while the fallback table is size 64. **The Bug** This PR actually makes it difficult to enable that optimization separately for the per-operator arrays vs. the fallback array, and incidentally shrunk the size of the fallback array from 64 to 8 for mobile (that happened on this line: https://github.com/pytorch/pytorch/pull/69633/files#diff-f735cd7aa68f15b624100cbc4bb3b5ea76ffc7c9d3bec3b0ccabaa09609e5319R294). That isn't a problem by itself (since mobile doesn't actually use any of the fallbacks that can no longer be stored). However, pytorch core will still register all of those fallback kernels on startup in mobile builds, even if they aren't used. When we tried to register one of those fallbacks on startup, it would try to dump the kernel somewhere in memory past the bounds of the (now smaller) array inside of the `Dispatcher` object, `backendFallbackKernels_`. **Why didn't this problem show up in OSS CI? Why didn't it break other internal mobile workflows aside from Milan?** Ideally, this failure would show up as part of the OSS signal on GitHub, since we already have mobile OSS builds. Given that it was another memory corruption issue that only affected Milan (subset of mobile), I'm not sure what's specific about Milan's builds that caused it only to manifest there. dreiss I wonder if there's another flavor of mobile builds we could run in OSS CI that could potentially help catch this? **The debugging experience was pretty difficult** Debugging the Milan-specific failure was made difficult by the following: (1) lack of CI - the original Milan failure didn't surface on my original diff, because the Milan job(s) that failed weren't triggered to run on pytorch changes. There's probably a balance to strike here, since those jobs will only be useful if they aren't flaky, and if they can produce reliable failure logs for debugging. (2) It's difficult to get a repro. - my work laptop doesn't have the right specs to run the Milan development workflow (not enough disk space) - There is an existing OnDemand workflow for Milan, but it appears to be relatively new, and after a bunch of help from MarcioPorto, we ran into issues forwarding the log output from Milan tests on the emulator back to the terminal (see the original discussion here: https://fb.workplace.com/groups/OnDemandFRL/permalink/1424937774645433/) (3) Lack of stack-traces. - Most Milan failures didn't include actionable stack traces. phding generously helped me debug by running my suggested patches locally, and reporting back if there were any failures. The failing test didn't include a stack trace though (just the line where the crash appeared), so I ended up making some educated guesses about what the issue was based on the area of the crash. ghstack-source-id: 152688542 Reviewed By: phding, albanD Differential Revision: D35222806 fbshipit-source-id: 0ad115a0f768bc8ea5d4c203b2990254c7092d30
Summary: Pull Request resolved: #74963 This is a re-land of D35192346 (9872a06) and D35192317 (a9216cd), which together are a diff that changes the internal representation of `DispatchKeySet` in pytorch core to free up the number of dispatch keys that we have available. See a more detailed description of the design in the original PR: #69633. The original PR broke Milan workflows, which use a pytorch mobile build, and manifested as a memory corruption bug inside of `liboacrmerged.so`. **Background: Existing Mobile Optimization** Pytorch mobile builds have an existing optimization (here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/c10/core/DispatchKey.h#L382 and here https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/OperatorEntry.h#L214), which works as follows: Every operator in pytorch has a "dispatch table" of function pointers, corresponding to all of the (up to 64) different kernels that we might dispatch to when we run an operator in pytorch (autograd, cpu, cuda, complex number support, etc). In mobile builds, the size of that table is shrunk from 64 to 8 to save a bunch of space, because mobile doesn't end up using the functionality associated with most dispatch keys. The dispatcher also has a notion of "fallback kernels", which are kernels that you can register to a particular dispatch key, but should be able to work for "any operator". The array of fallback kernels is defined here: https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/Dispatcher.h#L294. The mobile-optimization currently does **not** extend to this array (it wouldn't be that useful anyway because there is only one array of fallback kernels globally - vs. there is a separate dispatch table of function pointers per operator). So the per-operator tables on mobile are size 8, while the fallback table is size 64. **The Bug** This PR actually makes it difficult to enable that optimization separately for the per-operator arrays vs. the fallback array, and incidentally shrunk the size of the fallback array from 64 to 8 for mobile (that happened on this line: https://github.com/pytorch/pytorch/pull/69633/files#diff-f735cd7aa68f15b624100cbc4bb3b5ea76ffc7c9d3bec3b0ccabaa09609e5319R294). That isn't a problem by itself (since mobile doesn't actually use any of the fallbacks that can no longer be stored). However, pytorch core will still register all of those fallback kernels on startup in mobile builds, even if they aren't used. When we tried to register one of those fallbacks on startup, it would try to dump the kernel somewhere in memory past the bounds of the (now smaller) array inside of the `Dispatcher` object, `backendFallbackKernels_`. **Why didn't this problem show up in OSS CI? Why didn't it break other internal mobile workflows aside from Milan?** Ideally, this failure would show up as part of the OSS signal on GitHub, since we already have mobile OSS builds. Given that it was another memory corruption issue that only affected Milan (subset of mobile), I'm not sure what's specific about Milan's builds that caused it only to manifest there. dreiss I wonder if there's another flavor of mobile builds we could run in OSS CI that could potentially help catch this? **The debugging experience was pretty difficult** Debugging the Milan-specific failure was made difficult by the following: (1) lack of CI - the original Milan failure didn't surface on my original diff, because the Milan job(s) that failed weren't triggered to run on pytorch changes. There's probably a balance to strike here, since those jobs will only be useful if they aren't flaky, and if they can produce reliable failure logs for debugging. (2) It's difficult to get a repro. - my work laptop doesn't have the right specs to run the Milan development workflow (not enough disk space) - There is an existing OnDemand workflow for Milan, but it appears to be relatively new, and after a bunch of help from MarcioPorto, we ran into issues forwarding the log output from Milan tests on the emulator back to the terminal (see the original discussion here: https://fb.workplace.com/groups/OnDemandFRL/permalink/1424937774645433/) (3) Lack of stack-traces. - Most Milan failures didn't include actionable stack traces. phding generously helped me debug by running my suggested patches locally, and reporting back if there were any failures. The failing test didn't include a stack trace though (just the line where the crash appeared), so I ended up making some educated guesses about what the issue was based on the area of the crash. ghstack-source-id: 152688542 Test Plan: Confirmed with phding that the broken Milan workflow from the previous version of this diff is now passing. Reviewed By: phding, albanD Differential Revision: D35222806 fbshipit-source-id: 0ad115a0f768bc8ea5d4c203b2990254c7092d30
|
Hey @bdhirsh. |
Original commit changeset: b962de5d5eff Original Phabricator Diff: D35192346 Back out "Back out "DispatchKeySet perf improvements"" Original commit changeset: e38081810a56 Original Phabricator Diff: D35192317 Differential Revision: [D35222806](https://our.internmc.facebook.com/intern/diff/D35222806/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D35222806/)! ghstack-source-id: eb033a6 Pull Request resolved: pytorch/pytorch#74963
…ons for quantized & non-quantized tensors in item" Summary: This PR is part of a series of PRs addressing #54150, related to using dispatcher for calls to quantized backends as opposed to if/else conditionals. This particular PR separates the calls to quantized & non-quantized backends for item using a dispatcher. Simultaneous support of CompositeImplicitAutograd and Quantized dispatch keys was made possible with #74963 Test plan: There are numerous tests in the suite that make use of torch.Tensor.item. ``` python test/run_test.py ``` can be used for comprehensive evaluation, or alternatively, b/c this PR should not affect torch.Tensor.item calls on non-quantized tensors, we can specifically test on quantized tensors: ``` python test/test_quantization.py ``` Differential Revision: [D34004298](https://our.internmc.facebook.com/intern/diff/D34004298) [ghstack-poisoned]
…d & non-quantized tensors in item" Summary: This PR is part of a series of PRs addressing #54150, related to using dispatcher for calls to quantized backends as opposed to if/else conditionals. This particular PR separates the calls to quantized & non-quantized backends for item using a dispatcher. Simultaneous support of CompositeImplicitAutograd and Quantized dispatch keys was made possible with #74963 Test plan: There are numerous tests in the suite that make use of torch.Tensor.item. ``` python test/run_test.py ``` can be used for comprehensive evaluation, or alternatively, b/c this PR should not affect torch.Tensor.item calls on non-quantized tensors, we can specifically test on quantized tensors: ``` python test/test_quantization.py ``` Differential Revision: [D34004298](https://our.internmc.facebook.com/intern/diff/D34004298) [ghstack-poisoned]
…mentations for quantized & non-quantized tensors in item" Summary: This PR is part of a series of PRs addressing #54150, related to using dispatcher for calls to quantized backends as opposed to if/else conditionals. This particular PR separates the calls to quantized & non-quantized backends for item using a dispatcher. Simultaneous support of CompositeImplicitAutograd and Quantized dispatch keys was made possible with #74963 Test plan: There are numerous tests in the suite that make use of torch.Tensor.item. ``` python test/run_test.py ``` can be used for comprehensive evaluation, or alternatively, b/c this PR should not affect torch.Tensor.item calls on non-quantized tensors, we can specifically test on quantized tensors: ``` python test/test_quantization.py ``` Differential Revision: [D35517808](https://our.internmc.facebook.com/intern/diff/D35517808) [ghstack-poisoned]
…uantized & non-quantized tensors in item" Summary: This PR is part of a series of PRs addressing #54150, related to using dispatcher for calls to quantized backends as opposed to if/else conditionals. This particular PR separates the calls to quantized & non-quantized backends for item using a dispatcher. Simultaneous support of CompositeImplicitAutograd and Quantized dispatch keys was made possible with #74963 Test plan: There are numerous tests in the suite that make use of torch.Tensor.item. ``` python test/run_test.py ``` can be used for comprehensive evaluation, or alternatively, b/c this PR should not affect torch.Tensor.item calls on non-quantized tensors, we can specifically test on quantized tensors: ``` python test/test_quantization.py ``` Differential Revision: [D35517808](https://our.internmc.facebook.com/intern/diff/D35517808) [ghstack-poisoned]
…non-quantized tensors in item Summary: This PR is part of a series of PRs addressing #54150, related to using dispatcher for calls to quantized backends as opposed to if/else conditionals. This particular PR separates the calls to quantized & non-quantized backends for item using a dispatcher. Simultaneous support of CompositeImplicitAutograd and Quantized dispatch keys was made possible with #74963 Test plan: There are numerous tests in the suite that make use of torch.Tensor.item. ``` python test/run_test.py ``` can be used for comprehensive evaluation, or alternatively, b/c this PR should not affect torch.Tensor.item calls on non-quantized tensors, we can specifically test on quantized tensors: ``` python test/test_quantization.py ``` ghstack-source-id: 3264c38 Pull Request resolved: #72333
This PR is a re-land of #69633 (this is the second re-land attempt, the first one is at #72827). The original PR had a memory corruption bug that only surfaced on mobile builds.
Background: Existing Mobile Optimization
Pytorch mobile builds have an existing optimization (here and here), which works as follows:
Every operator in pytorch has a "dispatch table" of function pointers, corresponding to all of the (up to 64) different kernels that we might dispatch to when we run an operator in pytorch (autograd, cpu, cuda, complex number support, etc).
In mobile builds, the size of that table is shrunk from 64 to 8 to save a bunch of space, because mobile doesn't end up using the functionality associated with most dispatch keys.
The dispatcher also has a notion of "fallback kernels", which are kernels that you can register to a particular dispatch key, but should be able to work for "any operator". The array of fallback kernels is defined here.
The mobile-optimization currently does not extend to this array (it wouldn't be that useful anyway because there is only one array of fallback kernels globally - vs. there is a separate dispatch table of function pointers per operator).
The Bug
This PR actually makes it difficult to enable that optimization separately for the per-operator arrays vs. the fallback array, and incidentally shrunk the size of the fallback array from 64 to 8 for mobile (that happened on this line).
That isn't a problem by itself (since mobile doesn't actually use any of the fallbacks that can no longer be stored). However, pytorch core will still register all of those fallback kernels on startup in mobile builds, even if they aren't used. When we tried to register one of those fallbacks on startup, it would try to dump the kernel somewhere in memory past the bounds of the (now smaller) array inside of the Dispatcher object, backendFallbackKernels_.
Stack from ghstack (oldest at bottom):
NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on Phabricator!