Introduce HOP for inductor compiled regions to allow torch dispatch (inductor_compiled_code)#167844
Introduce HOP for inductor compiled regions to allow torch dispatch (inductor_compiled_code)#167844jamesjwu wants to merge 7 commits intogh/jamesjwu/207/basefrom
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/167844
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (4 Unrelated Failures)As of commit 1efd6c8 with merge base 0b3bdb0 ( BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben Lucaskabela [ghstack-poisoned]
This PR adds a indcutor option which you can pass into torch.compile that wraps all inductor generated code in a HOP, allowing it to be read by torch dispatches. This hop is created in output_code.post_compile, so it's cache safe. The configuration to turn it on is part of `inductor_config`, and therefore already part of the cache key. I've added a test that shows this HOP is cache safe. Because this wrapper occurs at compile time, there should be little to no cpu overhead from creating it, besides that of actually processing the torch_dispatches themselves. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben Lucaskabela [ghstack-poisoned]
This is a cleaned up version of the POC at https://github.com/pytorch/pytorch/pull/167752/files This PR adds a indcutor option which you can pass into torch.compile that wraps all inductor generated code in a HOP, allowing it to be read by torch dispatches. This hop is created in output_code.post_compile, so it's cache safe. The configuration to turn it on is part of `inductor_config`, and therefore already part of the cache key. I've added a test that shows this HOP is cache safe. Because this wrapper occurs at compile time, there should be little to no cpu overhead from creating it, besides that of actually processing the torch_dispatches themselves. The context here is we want to be able to support compiled regions such as flex attention in eager mode, while working with other torch dispatch tracers like SAC. Will add more tests for SAC/flex attention specific things next. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben Lucaskabela [ghstack-poisoned]
This is a cleaned up version of the POC at https://github.com/pytorch/pytorch/pull/167752/files This PR adds a indcutor option which you can pass into torch.compile that wraps all inductor generated code in a HOP, allowing it to be read by torch dispatches. This hop is created in output_code.post_compile, so it's cache safe. The configuration to turn it on is part of `inductor_config`, and therefore already part of the cache key. I've added a test that shows this HOP is cache safe. Because this wrapper occurs at compile time, there should be little to no cpu overhead from creating it, besides that of actually processing the torch_dispatches themselves. The context here is we want to be able to support compiled regions such as flex attention in eager mode, while working with other torch dispatch tracers like SAC. Will add more tests for SAC/flex attention specific things next. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben Lucaskabela [ghstack-poisoned]
torch/_inductor/config.py
Outdated
|
|
||
| # Wrap compiled regions in inductor_compiled_code HOP to make them visible to | ||
| # TorchDispatchModes like DebugMode and Selective Activation Checkpointing. | ||
| # This avoids runtime overhead of checking dispatch modes at every call. |
There was a problem hiding this comment.
Err, this comment is weird. When this config is on you are checking dispatch mode every call no
torch/_inductor/output_code.py
Outdated
| original_callable = self.current_callable | ||
|
|
||
| def wrapped_callable(inputs): | ||
| return inductor_compiled_code(original_callable, inputs) |
There was a problem hiding this comment.
I think my old strategy was good (explicitly testing if there's a mode on) and you should do it. The HOP dispatch is quite slow and I want to be moving us towards having this code on by default.
| self._boxed_call = True | ||
|
|
||
| # Store whether to wrap compiled regions in inductor_compiled_code HOP | ||
| # This is set at compile time to avoid runtime overhead |
There was a problem hiding this comment.
Uhh, sure, but this saving is dwarfed by the fact that you're always calling into the HOP now
|
|
||
| inductor_compiled_code = InductorCompiledCode() | ||
| inductor_compiled_code.fallthrough(DispatchKey.AutogradCPU) | ||
| inductor_compiled_code.fallthrough(DispatchKey.AutogradCUDA) |
There was a problem hiding this comment.
@patrick-toulme do we need to add MTIA here too
There was a problem hiding this comment.
For the love of god someone please make DispatchKey.Autograd here work LOL
|
|
||
| # Use config.patch to enable wrapping at inductor level | ||
| with inductor_config.patch({"wrap_inductor_compiled_regions": True}): | ||
| compiled_fn = torch.compile( |
There was a problem hiding this comment.
This test feels insufficient. I specifically am looking for a test where we SAC around a compiled region, but in every single one of these tests it seems you are still compiling around the SAC.
ezyang
left a comment
There was a problem hiding this comment.
Stamping to unblock, but the tests seem a bit sloppy
|
@pytorchbot merge Going to address these comments in the next PR |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
The one thing, this hop is a singleton and right now we can only ever annotate a region w/ 1 SAC policy right and not have per graph SAC policies? So this is the likely policy right def policy_fn(fn, *args, **kwargs):
if fn == `inductor_wraps_hop`:
return MUST_SAVEBut do we forsee any places where a user would want to do different policies ? Maybe we should pass in an fx_annotation into the op that users could match against |
Merge failedReason: 1 jobs have failed, first few of them are: trunk / macos-py3-arm64 / test (default, 3, 3, macos-m1-stable) Details for Dev Infra teamRaised by workflow job |
… dispatch" This is a cleaned up version of the POC at https://github.com/pytorch/pytorch/pull/167752/files This PR adds a inductor option which you can pass into torch.compile that wraps all inductor generated code in a HOP, allowing it to be read by torch dispatches. This hop is created in output_code.post_compile, so it's cache safe. The configuration to turn it on is part of `inductor_config`, and therefore already part of the cache key. I've added a test that shows this HOP is cache safe. Because this wrapper occurs at compile time, there should be little to no cpu overhead from creating it, besides that of actually processing the torch_dispatches themselves. The context here is we want to be able to support compiled regions such as flex attention in eager mode, while working with other torch dispatch tracers like SAC. Will add more tests for SAC/flex attention specific things next. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben Lucaskabela [ghstack-poisoned]
|
|
Hmm that's a good point about multiple policies — it seems like this should be addable, let me think on it |
… dispatch" This is a cleaned up version of the POC at https://github.com/pytorch/pytorch/pull/167752/files This PR adds a inductor option which you can pass into torch.compile that wraps all inductor generated code in a HOP, allowing it to be read by torch dispatches. This hop is created in output_code.post_compile, so it's cache safe. The configuration to turn it on is part of `inductor_config`, and therefore already part of the cache key. I've added a test that shows this HOP is cache safe. Because this wrapper occurs at compile time, there should be little to no cpu overhead from creating it, besides that of actually processing the torch_dispatches themselves. The context here is we want to be able to support compiled regions such as flex attention in eager mode, while working with other torch dispatch tracers like SAC. Will add more tests for SAC/flex attention specific things next. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben Lucaskabela [ghstack-poisoned]
|
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 4 checks: trunk / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, lf.linux.2xlarge, unstable), inductor / inductor-cpu-test / test (cpu_inductor_torchbench, 2, 2, linux.2xlarge.amx), inductor / inductor-cpu-test / test (dynamic_cpu_inductor_torchbench, 2, 2, linux.2xlarge.amx), inductor / inductor-test / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot merge -f "all unnecessary errors" |
|
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Stack from ghstack (oldest at bottom):
This is a cleaned up version of the POC at https://github.com/pytorch/pytorch/pull/167752/files
This PR adds a inductor option which you can pass into torch.compile that wraps all inductor generated code in a HOP, allowing it to be read by torch dispatches.
This hop is created in output_code.post_compile, so it's cache safe. The configuration to turn it on is part of
inductor_config, and therefore already part of the cache key. I've added a test that shows this HOP is cache safe.Because this wrapper occurs at compile time, there should be little to no cpu overhead from creating it, besides that of actually processing the torch_dispatches themselves.
The context here is we want to be able to support compiled regions such as flex attention in eager mode, while working with other torch dispatch tracers like SAC. Will add more tests for SAC/flex attention specific things next.
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @Lucaskabela