Skip to content

Conversation

@bdhirsh
Copy link
Contributor

@bdhirsh bdhirsh commented Dec 14, 2022

The problem is that for code like this
(existing test: TestInductorOpInfoCUDA.test_comprehensive_as_strided_partial_views_cuda_float64)

@torch.compile
def f(x):
    return x.as_strided(shape, stride)

base = torch.ones(20)
slice = base[5:15]
out = f(base)

Inductor has some logic to clone the inputs, but it doesn't respect the storage offset of said inputs. The above example breaks, because we will put torch.ops.aten.as_strided(x, size, stride) into the graph, and the storage_offset will be inferred from the input, to (incorrectly) be 0 instead of 5. We are also forced to use

This used to work before the current refactor, because aot autograd was not passing output-that-view-input ops into the graph at all, and instead, it would just dump the (hardcoded) sizes/strides of views as outputs into the graph. With the changes in the aot autograd PR, we're now trusting inductor to return us an output alias with the correct metadata. Which... we kinda need to, since we talked a few weeks ago about how inductor can choose its own output memory format, so we need inductor to tell us what the strides of the outputs are in general.

Stack from ghstack (oldest at bottom):

cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @desertfire

@pytorch-bot
Copy link

pytorch-bot bot commented Dec 14, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/90870

Note: Links to docs will display an error until the docs builds have been completed.

❌ 5 Failures

As of commit ee9a79b:

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following jobs failed but were present on the merge base f6c7cf1:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@github-actions
Copy link
Contributor

This PR needs a label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

For more information, see https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

# TODO: figure out a way to get "# elements in the storage, in the current dtype of x"
needed_size = x.storage().size()
buffer = torch.as_strided(x, (needed_size,), (1,), 0).clone()
return torch.as_strided(buffer, x.size(), x.stride(), x.storage_offset())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang this is just a test that fixed a bunch of CI failures locally for me.

The problem is that for code like this:

@torch.compile
def f(x):
    return x.as_strided(shape, stride)

base = torch.ones(20)
slice = base[5:15]
out = f(base)

Inductor has some logic to clone the inputs, but it doesn't respect the storage offset of said inputs. The above example breaks, because we will put torch.ops.aten.as_strided(x, size, stride) into the graph, and the storage_offset will be inferred from the input, to (incorrectly) be 0 instead of 5.

This used to work before the current refactor, because aot autograd was not passing output-that-view-input ops into the graph at all, and instead, it would just dump the (hardcoded) sizes/strides of views as outputs into the graph. With the changes in the aot autograd PR, we're now trusting inductor to return us an output alias with the correct metadata. Which... we kinda need to, since we talked a few weeks ago about how inductor can choose its own output memory format, so we need inductor to tell us what the strides of the outputs are in general.

Copy link
Contributor Author

@bdhirsh bdhirsh Dec 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although as I mentioned in the comment, I'm not sure that this is the best approach (very much open to feedback):

(1) I'm using the deprecated storage API. When I try x._storage().size() instead, I get the wrong thing: I get "number of elements in the storage assuming each element is 1 byte". What I really want is "number of elements in the storage, given that the storage is the same dtype as x"

(2) This will break XLA, although we can work around it. (it will also not work for sparse tensors, etc, but i'm not really sure that the original impl worked well with sparse tensors either, given the as strided calls).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain to me why compositional as strided (in this case, you only need compositional storage offset: the storage offset you apply is relative to the preexisting offset from the input) doesn't work first

Copy link
Contributor Author

@bdhirsh bdhirsh Dec 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm so... it doesn't seem to me like compositional as_strided/storage_offset is enough here.

  • If the original input tensor had storage_offset=5, that's because it was some view of a larger storage
  • If we want the output to also properly preserve storage_offset=5, then the output needs to also be a view off of the same size storage as the input
  • So we actually do need to clone the enough input storage (including the first 5 elements before the storage offset), before taking an equivalent view off of it (which is what the snippet is trying to do)

Comment on lines 177 to 179
# This is *bad*, x.storage() is deprecated and emits warnings
# TODO: figure out a way to get "# elements in the storage, in the current dtype of x"
needed_size = x.storage().size()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't care about any data after the tensor in the storage, you could use compute_required_storage_length

Suggested change
# This is *bad*, x.storage() is deprecated and emits warnings
# TODO: figure out a way to get "# elements in the storage, in the current dtype of x"
needed_size = x.storage().size()
from torch._prims_common import compute_required_storage_length
needed_size = compute_required_storage_length(x.size(), x.stride(), x.storage_offset())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cc @kurtamohler but it should be pretty easy; storage APIs are always in terms of bytes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @peterbell10, that's exactly what I was looking for!

…when cloning inputs"

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
…when cloning inputs"

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
…when cloning inputs"

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test would be nice

…when cloning inputs"

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
…when cloning inputs"

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
@bdhirsh bdhirsh added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 20, 2022
…when cloning inputs"

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
…when cloning inputs"

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
…when cloning inputs"

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
…when cloning inputs"

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
@bdhirsh
Copy link
Contributor Author

bdhirsh commented Dec 22, 2022

@wconstab @ngimel, I'm interested if you have any thoughts. This fixed an as_strided test involving partial views in the inductor op info test suite, but it looks like it's breaking an inductor + distributed test.

python test/distributed/test_dynamo_distributed.py TestDistributedMultiProc.test_hf_bert_fsdp

Which now fails with a CUDA misaligned address (full paste here)

  File "/scratch/hirsheybar2/work/pytorch/torch/_inductor/triton_ops/autotune.py", line 84, in _precompile_config
    torch.cuda.synchronize(torch.cuda.current_device())
  File "/scratch/hirsheybar2/work/pytorch/torch/cuda/__init__.py", line 577, in synchronize
    return torch._C._cuda_synchronize()
RuntimeError: CUDA error: misaligned address

I'm wondering:
(1) do you think that morally, inductor should respect storage offset when its clones inputs? It definitely needs to for this test, although user code with .as_strided() on an existing input view is definitely a corner case that we can xfail.

(2) If that's the right fix, then I'm pretty stumped by the distributed test failure. Do you have any tips on where to look?

…when cloning inputs"

The problem is that for code like this
(existing test: `TestInductorOpInfoCUDA.test_comprehensive_as_strided_partial_views_cuda_float64`)

```
torch.compile
def f(x):
    return x.as_strided(shape, stride)

base = torch.ones(20)
slice = base[5:15]
out = f(base)
```

Inductor has some logic to clone the inputs, but it doesn't respect the storage offset of said inputs. The above example breaks, because we will put torch.ops.aten.as_strided(x, size, stride) into the graph, and the storage_offset will be inferred from the input, to (incorrectly) be 0 instead of 5. We are also forced to use 

This used to work before the current refactor, because aot autograd was not passing output-that-view-input ops into the graph at all, and instead, it would just dump the (hardcoded) sizes/strides of views as outputs into the graph. With the changes in the aot autograd PR, we're now trusting inductor to return us an output alias with the correct metadata. Which... we kinda need to, since we talked a few weeks ago about how inductor can choose its own output memory format, so we need inductor to tell us what the strides of the outputs are in general.




cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
@wconstab
Copy link
Contributor

wconstab commented Jan 3, 2023

@bdhirsh is your theory that the change to inductor to respect original offsets is correct, but leads to unaligned addresses for some views? And these unaligned addresses only upset inductor (but worked when the program ran in eager)? (And that such cases are happening in FSDP somewhere?) Any more details you can provide?

ezyang added 2 commits January 9, 2023 07:24
…when cloning inputs"

The problem is that for code like this
(existing test: `TestInductorOpInfoCUDA.test_comprehensive_as_strided_partial_views_cuda_float64`)

```
torch.compile
def f(x):
    return x.as_strided(shape, stride)

base = torch.ones(20)
slice = base[5:15]
out = f(base)
```

Inductor has some logic to clone the inputs, but it doesn't respect the storage offset of said inputs. The above example breaks, because we will put torch.ops.aten.as_strided(x, size, stride) into the graph, and the storage_offset will be inferred from the input, to (incorrectly) be 0 instead of 5. We are also forced to use 

This used to work before the current refactor, because aot autograd was not passing output-that-view-input ops into the graph at all, and instead, it would just dump the (hardcoded) sizes/strides of views as outputs into the graph. With the changes in the aot autograd PR, we're now trusting inductor to return us an output alias with the correct metadata. Which... we kinda need to, since we talked a few weeks ago about how inductor can choose its own output memory format, so we need inductor to tell us what the strides of the outputs are in general.




cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
…when cloning inputs"

The problem is that for code like this
(existing test: `TestInductorOpInfoCUDA.test_comprehensive_as_strided_partial_views_cuda_float64`)

```
torch.compile
def f(x):
    return x.as_strided(shape, stride)

base = torch.ones(20)
slice = base[5:15]
out = f(base)
```

Inductor has some logic to clone the inputs, but it doesn't respect the storage offset of said inputs. The above example breaks, because we will put torch.ops.aten.as_strided(x, size, stride) into the graph, and the storage_offset will be inferred from the input, to (incorrectly) be 0 instead of 5. We are also forced to use 

This used to work before the current refactor, because aot autograd was not passing output-that-view-input ops into the graph at all, and instead, it would just dump the (hardcoded) sizes/strides of views as outputs into the graph. With the changes in the aot autograd PR, we're now trusting inductor to return us an output alias with the correct metadata. Which... we kinda need to, since we talked a few weeks ago about how inductor can choose its own output memory format, so we need inductor to tell us what the strides of the outputs are in general.




cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
@github-actions
Copy link
Contributor

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Mar 10, 2023
@github-actions github-actions bot closed this Apr 9, 2023
@facebook-github-bot facebook-github-bot deleted the gh/bdhirsh/358/head branch June 8, 2023 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants