[test] inductor should take storage_offset() into account when cloning inputs #90870

bdhirsh · 2022-12-14T22:36:06Z

The problem is that for code like this
(existing test: TestInductorOpInfoCUDA.test_comprehensive_as_strided_partial_views_cuda_float64)

@torch.compile
def f(x):
    return x.as_strided(shape, stride)

base = torch.ones(20)
slice = base[5:15]
out = f(base)

Inductor has some logic to clone the inputs, but it doesn't respect the storage offset of said inputs. The above example breaks, because we will put torch.ops.aten.as_strided(x, size, stride) into the graph, and the storage_offset will be inferred from the input, to (incorrectly) be 0 instead of 5. We are also forced to use

This used to work before the current refactor, because aot autograd was not passing output-that-view-input ops into the graph at all, and instead, it would just dump the (hardcoded) sizes/strides of views as outputs into the graph. With the changes in the aot autograd PR, we're now trusting inductor to return us an output alias with the correct metadata. Which... we kinda need to, since we talked a few weeks ago about how inductor can choose its own output memory format, so we need inductor to tell us what the strides of the outputs are in general.

Stack from ghstack (oldest at bottom):

cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @desertfire

…g inputs [ghstack-poisoned]

pytorch-bot · 2022-12-14T22:36:09Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/90870

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 5 Failures

As of commit ee9a79b:

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following jobs failed but were present on the merge base f6c7cf1:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…g inputs ghstack-source-id: 785506c Pull Request resolved: #90870

github-actions · 2022-12-14T22:36:42Z

This PR needs a label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

For more information, see https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

bdhirsh · 2022-12-14T23:17:49Z

torch/_inductor/compile_fx.py

+    # TODO: figure out a way to get "# elements in the storage, in the current dtype of x"
+    needed_size = x.storage().size()
+    buffer = torch.as_strided(x, (needed_size,), (1,), 0).clone()
+    return torch.as_strided(buffer, x.size(), x.stride(), x.storage_offset())


@ezyang this is just a test that fixed a bunch of CI failures locally for me.

The problem is that for code like this:

@torch.compile def f(x): return x.as_strided(shape, stride) base = torch.ones(20) slice = base[5:15] out = f(base)

Inductor has some logic to clone the inputs, but it doesn't respect the storage offset of said inputs. The above example breaks, because we will put torch.ops.aten.as_strided(x, size, stride) into the graph, and the storage_offset will be inferred from the input, to (incorrectly) be 0 instead of 5.

This used to work before the current refactor, because aot autograd was not passing output-that-view-input ops into the graph at all, and instead, it would just dump the (hardcoded) sizes/strides of views as outputs into the graph. With the changes in the aot autograd PR, we're now trusting inductor to return us an output alias with the correct metadata. Which... we kinda need to, since we talked a few weeks ago about how inductor can choose its own output memory format, so we need inductor to tell us what the strides of the outputs are in general.

Although as I mentioned in the comment, I'm not sure that this is the best approach (very much open to feedback):

(1) I'm using the deprecated storage API. When I try x._storage().size() instead, I get the wrong thing: I get "number of elements in the storage assuming each element is 1 byte". What I really want is "number of elements in the storage, given that the storage is the same dtype as x"

(2) This will break XLA, although we can work around it. (it will also not work for sparse tensors, etc, but i'm not really sure that the original impl worked well with sparse tensors either, given the as strided calls).

Explain to me why compositional as strided (in this case, you only need compositional storage offset: the storage offset you apply is relative to the preexisting offset from the input) doesn't work first

Hmm so... it doesn't seem to me like compositional as_strided/storage_offset is enough here.

If the original input tensor had storage_offset=5, that's because it was some view of a larger storage

If we want the output to also properly preserve storage_offset=5, then the output needs to also be a view off of the same size storage as the input

So we actually do need to clone the enough input storage (including the first 5 elements before the storage offset), before taking an equivalent view off of it (which is what the snippet is trying to do)

peterbell10 · 2022-12-15T01:39:11Z

torch/_inductor/compile_fx.py

+    # This is *bad*, x.storage() is deprecated and emits warnings
+    # TODO: figure out a way to get "# elements in the storage, in the current dtype of x"
+    needed_size = x.storage().size()


If you don't care about any data after the tensor in the storage, you could use compute_required_storage_length

Suggested change

# This is *bad*, x.storage() is deprecated and emits warnings

# TODO: figure out a way to get "# elements in the storage, in the current dtype of x"

needed_size = x.storage().size()

from torch._prims_common import compute_required_storage_length

needed_size = compute_required_storage_length(x.size(), x.stride(), x.storage_offset())

Cc @kurtamohler but it should be pretty easy; storage APIs are always in terms of bytes

thanks @peterbell10, that's exactly what I was looking for!

…when cloning inputs" cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

ezyang

test would be nice

…when cloning inputs" cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

bdhirsh · 2022-12-22T15:17:59Z

@wconstab @ngimel, I'm interested if you have any thoughts. This fixed an as_strided test involving partial views in the inductor op info test suite, but it looks like it's breaking an inductor + distributed test.

python test/distributed/test_dynamo_distributed.py TestDistributedMultiProc.test_hf_bert_fsdp

Which now fails with a CUDA misaligned address (full paste here)

  File "/scratch/hirsheybar2/work/pytorch/torch/_inductor/triton_ops/autotune.py", line 84, in _precompile_config
    torch.cuda.synchronize(torch.cuda.current_device())
  File "/scratch/hirsheybar2/work/pytorch/torch/cuda/__init__.py", line 577, in synchronize
    return torch._C._cuda_synchronize()
RuntimeError: CUDA error: misaligned address

I'm wondering:
(1) do you think that morally, inductor should respect storage offset when its clones inputs? It definitely needs to for this test, although user code with .as_strided() on an existing input view is definitely a corner case that we can xfail.

(2) If that's the right fix, then I'm pretty stumped by the distributed test failure. Do you have any tips on where to look?

…when cloning inputs" The problem is that for code like this (existing test: `TestInductorOpInfoCUDA.test_comprehensive_as_strided_partial_views_cuda_float64`) ``` torch.compile def f(x): return x.as_strided(shape, stride) base = torch.ones(20) slice = base[5:15] out = f(base) ``` Inductor has some logic to clone the inputs, but it doesn't respect the storage offset of said inputs. The above example breaks, because we will put torch.ops.aten.as_strided(x, size, stride) into the graph, and the storage_offset will be inferred from the input, to (incorrectly) be 0 instead of 5. We are also forced to use This used to work before the current refactor, because aot autograd was not passing output-that-view-input ops into the graph at all, and instead, it would just dump the (hardcoded) sizes/strides of views as outputs into the graph. With the changes in the aot autograd PR, we're now trusting inductor to return us an output alias with the correct metadata. Which... we kinda need to, since we talked a few weeks ago about how inductor can choose its own output memory format, so we need inductor to tell us what the strides of the outputs are in general. cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

wconstab · 2023-01-03T23:39:36Z

@bdhirsh is your theory that the change to inductor to respect original offsets is correct, but leads to unaligned addresses for some views? And these unaligned addresses only upset inductor (but worked when the program ran in eager)? (And that such cases are happening in FSDP somewhere?) Any more details you can provide?

…when cloning inputs" The problem is that for code like this (existing test: `TestInductorOpInfoCUDA.test_comprehensive_as_strided_partial_views_cuda_float64`) ``` torch.compile def f(x): return x.as_strided(shape, stride) base = torch.ones(20) slice = base[5:15] out = f(base) ``` Inductor has some logic to clone the inputs, but it doesn't respect the storage offset of said inputs. The above example breaks, because we will put torch.ops.aten.as_strided(x, size, stride) into the graph, and the storage_offset will be inferred from the input, to (incorrectly) be 0 instead of 5. We are also forced to use This used to work before the current refactor, because aot autograd was not passing output-that-view-input ops into the graph at all, and instead, it would just dump the (hardcoded) sizes/strides of views as outputs into the graph. With the changes in the aot autograd PR, we're now trusting inductor to return us an output alias with the correct metadata. Which... we kinda need to, since we talked a few weeks ago about how inductor can choose its own output memory format, so we need inductor to tell us what the strides of the outputs are in general. cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

github-actions · 2023-03-10T21:33:54Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

[test] inductor should take storage_offset() into account when clonin…

d5f3fc3

…g inputs [ghstack-poisoned]

bdhirsh mentioned this pull request Dec 14, 2022

AOT Autograd refactor + cleanup, handle intermediate views of bases, use view replay, fix non-tensor input handling #89532

Closed

github-actions bot requested review from Chillee, SherlockNoMad, albanD, anjali411, antoniojkim, ezyang, miladm, voznesenskym and wconstab December 14, 2022 22:36

bdhirsh added a commit that referenced this pull request Dec 14, 2022

[test] inductor should take storage_offset() into account when clonin…

9fdc884

…g inputs ghstack-source-id: 785506c Pull Request resolved: #90870

github-actions bot added ciflow/inductor module: inductor labels Dec 14, 2022

bdhirsh removed request for Chillee, SherlockNoMad, albanD, anjali411, antoniojkim, ezyang, miladm, voznesenskym and wconstab December 14, 2022 22:58

bdhirsh commented Dec 14, 2022

View reviewed changes

peterbell10 reviewed Dec 15, 2022

View reviewed changes

bdhirsh mentioned this pull request Dec 15, 2022

AOT Autograd: clone mutated inputs that are saved for bw #90938

Closed

bdhirsh mentioned this pull request Dec 16, 2022

*_scatter ops should preserve input stride/storage_offset #91029

Closed

ezyang approved these changes Dec 16, 2022

View reviewed changes

bdhirsh added 2 commits December 16, 2022 21:20

bdhirsh added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 20, 2022

bdhirsh added 4 commits December 20, 2022 22:00

ezyang added 2 commits January 9, 2023 07:24

github-actions bot added the Stale label Mar 10, 2023

github-actions bot closed this Apr 9, 2023

facebook-github-bot deleted the gh/bdhirsh/358/head branch June 8, 2023 15:45

[test] inductor should take storage_offset() into account when cloning inputs #90870

[test] inductor should take storage_offset() into account when cloning inputs #90870

Uh oh!

Conversation

bdhirsh commented Dec 14, 2022 • edited by ezyang Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/90870

❌ 5 Failures

Uh oh!

github-actions bot commented Dec 14, 2022

This PR needs a label

Uh oh!

bdhirsh Dec 14, 2022

Choose a reason for hiding this comment

Uh oh!

bdhirsh Dec 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ezyang Dec 15, 2022

Choose a reason for hiding this comment

Uh oh!

bdhirsh Dec 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peterbell10 Dec 15, 2022

Choose a reason for hiding this comment

Uh oh!

ezyang Dec 15, 2022

Choose a reason for hiding this comment

Uh oh!

bdhirsh Dec 15, 2022

Choose a reason for hiding this comment

Uh oh!

ezyang left a comment

Choose a reason for hiding this comment

Uh oh!

bdhirsh commented Dec 22, 2022

Uh oh!

wconstab commented Jan 3, 2023

Uh oh!

github-actions bot commented Mar 10, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

bdhirsh commented Dec 14, 2022 •

edited by ezyang

Loading

pytorch-bot bot commented Dec 14, 2022 •

edited

Loading

bdhirsh Dec 14, 2022 •

edited

Loading

bdhirsh Dec 15, 2022 •

edited

Loading