[SR] Eliminate extra permute ops before `aten::sum` #74481

mikeiovine · 2022-03-21T17:34:24Z

Stack from ghstack (oldest at bottom):

-> [SR] Eliminate extra permute ops before aten::sum #74481

This diff fixes an interesting performance issue related to permute_copy.

We see this pattern frequently:

y = torch.permute(x, (0, 2, 1))
z = torch.sum(y, dim=-1)

With copy variants off, we get a strided output from permute, and we hit this (faster) kernel in sum: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/SumKernel.cpp#L589

But with copy variants on, we get a contiguous output from permute_copy, which causes us to hit the slower reduction:
https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/SumKernel.cpp#L597

But the permute is actually unnecessary, we can just statically turn the graph into this to ensure that the fast kernel is hit with copy variants on:

z = torch.sum(x, dim=1)

Differential Revision: D34992319

NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on Phabricator!

This diff fixes an interesting performance issue related to `permute_copy`. We see this pattern frequently: ``` y = torch.permute(x, (0, 2, 1)) z = torch.sum(y, dim=-1) ``` With copy variants off, we get a strided output from `permute`, and we hit this (faster) kernel in `sum`: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/SumKernel.cpp#L589 But with copy variants on, we get a contiguous output from `permute_copy`, which causes us to hit the slower reduction: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/SumKernel.cpp#L597 But the permute is actually unnecessary, we can just statically turn the graph into this to ensure that the fast kernel is hit with copy variants on: ``` z = torch.sum(x, dim=1) ``` Differential Revision: [D34992319](https://our.internmc.facebook.com/intern/diff/D34992319/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D34992319/)! [ghstack-poisoned]

facebook-github-bot · 2022-03-21T17:34:29Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/74481
📄 Preview docs built from this PR
📄 Preview C++ docs built from this PR
Need help or want to give feedback on the CI? Visit our office hours

💊 CI failures summary and remediations

As of commit 9b55df5 (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

This diff fixes an interesting performance issue related to `permute_copy`. We see this pattern frequently: ``` y = torch.permute(x, (0, 2, 1)) z = torch.sum(y, dim=-1) ``` With copy variants off, we get a strided output from `permute`, and we hit this (faster) kernel in `sum`: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/SumKernel.cpp#L589 But with copy variants on, we get a contiguous output from `permute_copy`, which causes us to hit the slower reduction: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/SumKernel.cpp#L597 But the permute is actually unnecessary, we can just statically turn the graph into this to ensure that the fast kernel is hit with copy variants on: ``` z = torch.sum(x, dim=1) ``` Differential Revision: [D34992319](https://our.internmc.facebook.com/intern/diff/D34992319/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D34992319/)! [ghstack-poisoned]

Pull Request resolved: #74481 This diff fixes an interesting performance issue related to `permute_copy`. We see this pattern frequently: ``` y = torch.permute(x, (0, 2, 1)) z = torch.sum(y, dim=-1) ``` With copy variants off, we get a strided output from `permute`, and we hit this (faster) kernel in `sum`: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/SumKernel.cpp#L589 But with copy variants on, we get a contiguous output from `permute_copy`, which causes us to hit the slower reduction: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/SumKernel.cpp#L597 But the permute is actually unnecessary, we can just statically turn the graph into this to ensure that the fast kernel is hit with copy variants on: ``` z = torch.sum(x, dim=1) ``` ghstack-source-id: 151822137 Differential Revision: [D34992319](https://our.internmc.facebook.com/intern/diff/D34992319/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D34992319/)!

This diff fixes an interesting performance issue related to `permute_copy`. We see this pattern frequently: ``` y = torch.permute(x, (0, 2, 1)) z = torch.sum(y, dim=-1) ``` With copy variants off, we get a strided output from `permute`, and we hit this (faster) kernel in `sum`: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/SumKernel.cpp#L589 But with copy variants on, we get a contiguous output from `permute_copy`, which causes us to hit the slower reduction: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/SumKernel.cpp#L597 But the permute is actually unnecessary, we can just statically turn the graph into this to ensure that the fast kernel is hit with copy variants on: ``` z = torch.sum(x, dim=1) ``` Differential Revision: [D34992319](https://our.internmc.facebook.com/intern/diff/D34992319/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D34992319/)! [ghstack-poisoned]

Pull Request resolved: #74481 This diff fixes an interesting performance issue related to `permute_copy`. We see this pattern frequently: ``` y = torch.permute(x, (0, 2, 1)) z = torch.sum(y, dim=-1) ``` With copy variants off, we get a strided output from `permute`, and we hit this (faster) kernel in `sum`: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/SumKernel.cpp#L589 But with copy variants on, we get a contiguous output from `permute_copy`, which causes us to hit the slower reduction: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/SumKernel.cpp#L597 But the permute is actually unnecessary, we can just statically turn the graph into this to ensure that the fast kernel is hit with copy variants on: ``` z = torch.sum(x, dim=1) ``` ghstack-source-id: 152003888 Differential Revision: [D34992319](https://our.internmc.facebook.com/intern/diff/D34992319/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D34992319/)!

Summary: Pull Request resolved: #74481 This diff fixes an interesting performance issue related to `permute_copy`. We see this pattern frequently: ``` y = torch.permute(x, (0, 2, 1)) z = torch.sum(y, dim=-1) ``` With copy variants off, we get a strided output from `permute`, and we hit this (faster) kernel in `sum`: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/SumKernel.cpp#L589 But with copy variants on, we get a contiguous output from `permute_copy`, which causes us to hit the slower reduction: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/SumKernel.cpp#L597 But the permute is actually unnecessary, we can just statically turn the graph into this to ensure that the fast kernel is hit with copy variants on: ``` z = torch.sum(x, dim=1) ``` ghstack-source-id: 152003888 Reviewed By: navahgar Differential Revision: D34992319 fbshipit-source-id: 0baf493708ee2180c899814a954d220d88ba1d4f

github-actions · 2022-03-23T23:01:23Z

Hey @mikeiovine.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

Summary: Pull Request resolved: #74481 This diff fixes an interesting performance issue related to `permute_copy`. We see this pattern frequently: ``` y = torch.permute(x, (0, 2, 1)) z = torch.sum(y, dim=-1) ``` With copy variants off, we get a strided output from `permute`, and we hit this (faster) kernel in `sum`: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/SumKernel.cpp#L589 But with copy variants on, we get a contiguous output from `permute_copy`, which causes us to hit the slower reduction: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/SumKernel.cpp#L597 But the permute is actually unnecessary, we can just statically turn the graph into this to ensure that the fast kernel is hit with copy variants on: ``` z = torch.sum(x, dim=1) ``` ghstack-source-id: 152003888 Reviewed By: navahgar Differential Revision: D34992319 fbshipit-source-id: 0baf493708ee2180c899814a954d220d88ba1d4f (cherry picked from commit 797b6be)

facebook-github-bot added cla signed oncall: jit Add this issue/PR to JIT oncall triage queue labels Mar 21, 2022

pytorchmergebot closed this in f5a9c36 Mar 23, 2022

facebook-github-bot deleted the gh/mikeiovine/39/head branch March 27, 2022 14:17

WBobby mentioned this pull request Aug 17, 2022

Add ROCm5.2.3/AMDGPU support for PyTorch WBobby/pytorch#2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SR] Eliminate extra permute ops before `aten::sum` #74481

[SR] Eliminate extra permute ops before `aten::sum` #74481

Uh oh!

mikeiovine commented Mar 21, 2022 •

edited

Loading

Uh oh!

facebook-github-bot commented Mar 21, 2022 •

edited

Loading

Uh oh!

github-actions bot commented Mar 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SR] Eliminate extra permute ops before aten::sum #74481

[SR] Eliminate extra permute ops before aten::sum #74481

Uh oh!

Conversation

mikeiovine commented Mar 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Mar 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

💊 CI failures summary and remediations

Uh oh!

github-actions bot commented Mar 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SR] Eliminate extra permute ops before `aten::sum` #74481

[SR] Eliminate extra permute ops before `aten::sum` #74481

mikeiovine commented Mar 21, 2022 •

edited

Loading

facebook-github-bot commented Mar 21, 2022 •

edited

Loading