-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[test] inductor should take storage_offset() into account when cloning inputs #90870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…g inputs [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/90870
Note: Links to docs will display an error until the docs builds have been completed. ❌ 5 FailuresAs of commit ee9a79b: NEW FAILURES - The following jobs have failed:
BROKEN TRUNK - The following jobs failed but were present on the merge base f6c7cf1:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a labelIf your changes are user facing and intended to be a part of release notes, please use a label starting with If not, please add the For more information, see https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work. |
| # TODO: figure out a way to get "# elements in the storage, in the current dtype of x" | ||
| needed_size = x.storage().size() | ||
| buffer = torch.as_strided(x, (needed_size,), (1,), 0).clone() | ||
| return torch.as_strided(buffer, x.size(), x.stride(), x.storage_offset()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezyang this is just a test that fixed a bunch of CI failures locally for me.
The problem is that for code like this:
@torch.compile
def f(x):
return x.as_strided(shape, stride)
base = torch.ones(20)
slice = base[5:15]
out = f(base)
Inductor has some logic to clone the inputs, but it doesn't respect the storage offset of said inputs. The above example breaks, because we will put torch.ops.aten.as_strided(x, size, stride) into the graph, and the storage_offset will be inferred from the input, to (incorrectly) be 0 instead of 5.
This used to work before the current refactor, because aot autograd was not passing output-that-view-input ops into the graph at all, and instead, it would just dump the (hardcoded) sizes/strides of views as outputs into the graph. With the changes in the aot autograd PR, we're now trusting inductor to return us an output alias with the correct metadata. Which... we kinda need to, since we talked a few weeks ago about how inductor can choose its own output memory format, so we need inductor to tell us what the strides of the outputs are in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although as I mentioned in the comment, I'm not sure that this is the best approach (very much open to feedback):
(1) I'm using the deprecated storage API. When I try x._storage().size() instead, I get the wrong thing: I get "number of elements in the storage assuming each element is 1 byte". What I really want is "number of elements in the storage, given that the storage is the same dtype as x"
(2) This will break XLA, although we can work around it. (it will also not work for sparse tensors, etc, but i'm not really sure that the original impl worked well with sparse tensors either, given the as strided calls).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain to me why compositional as strided (in this case, you only need compositional storage offset: the storage offset you apply is relative to the preexisting offset from the input) doesn't work first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm so... it doesn't seem to me like compositional as_strided/storage_offset is enough here.
- If the original input tensor had storage_offset=5, that's because it was some view of a larger storage
- If we want the output to also properly preserve storage_offset=5, then the output needs to also be a view off of the same size storage as the input
- So we actually do need to clone the enough input storage (including the first 5 elements before the storage offset), before taking an equivalent view off of it (which is what the snippet is trying to do)
torch/_inductor/compile_fx.py
Outdated
| # This is *bad*, x.storage() is deprecated and emits warnings | ||
| # TODO: figure out a way to get "# elements in the storage, in the current dtype of x" | ||
| needed_size = x.storage().size() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't care about any data after the tensor in the storage, you could use compute_required_storage_length
| # This is *bad*, x.storage() is deprecated and emits warnings | |
| # TODO: figure out a way to get "# elements in the storage, in the current dtype of x" | |
| needed_size = x.storage().size() | |
| from torch._prims_common import compute_required_storage_length | |
| needed_size = compute_required_storage_length(x.size(), x.stride(), x.storage_offset()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cc @kurtamohler but it should be pretty easy; storage APIs are always in terms of bytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @peterbell10, that's exactly what I was looking for!
…when cloning inputs" cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
…when cloning inputs" cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
…when cloning inputs" cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
ezyang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test would be nice
…when cloning inputs" cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
…when cloning inputs" cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
…when cloning inputs" cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
…when cloning inputs" cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
…when cloning inputs" cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
…when cloning inputs" cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
|
@wconstab @ngimel, I'm interested if you have any thoughts. This fixed an Which now fails with a CUDA misaligned address (full paste here) I'm wondering: (2) If that's the right fix, then I'm pretty stumped by the distributed test failure. Do you have any tips on where to look? |
…when cloning inputs"
The problem is that for code like this
(existing test: `TestInductorOpInfoCUDA.test_comprehensive_as_strided_partial_views_cuda_float64`)
```
torch.compile
def f(x):
return x.as_strided(shape, stride)
base = torch.ones(20)
slice = base[5:15]
out = f(base)
```
Inductor has some logic to clone the inputs, but it doesn't respect the storage offset of said inputs. The above example breaks, because we will put torch.ops.aten.as_strided(x, size, stride) into the graph, and the storage_offset will be inferred from the input, to (incorrectly) be 0 instead of 5. We are also forced to use
This used to work before the current refactor, because aot autograd was not passing output-that-view-input ops into the graph at all, and instead, it would just dump the (hardcoded) sizes/strides of views as outputs into the graph. With the changes in the aot autograd PR, we're now trusting inductor to return us an output alias with the correct metadata. Which... we kinda need to, since we talked a few weeks ago about how inductor can choose its own output memory format, so we need inductor to tell us what the strides of the outputs are in general.
cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire
[ghstack-poisoned]
|
@bdhirsh is your theory that the change to inductor to respect original offsets is correct, but leads to unaligned addresses for some views? And these unaligned addresses only upset inductor (but worked when the program ran in eager)? (And that such cases are happening in FSDP somewhere?) Any more details you can provide? |
…when cloning inputs"
The problem is that for code like this
(existing test: `TestInductorOpInfoCUDA.test_comprehensive_as_strided_partial_views_cuda_float64`)
```
torch.compile
def f(x):
return x.as_strided(shape, stride)
base = torch.ones(20)
slice = base[5:15]
out = f(base)
```
Inductor has some logic to clone the inputs, but it doesn't respect the storage offset of said inputs. The above example breaks, because we will put torch.ops.aten.as_strided(x, size, stride) into the graph, and the storage_offset will be inferred from the input, to (incorrectly) be 0 instead of 5. We are also forced to use
This used to work before the current refactor, because aot autograd was not passing output-that-view-input ops into the graph at all, and instead, it would just dump the (hardcoded) sizes/strides of views as outputs into the graph. With the changes in the aot autograd PR, we're now trusting inductor to return us an output alias with the correct metadata. Which... we kinda need to, since we talked a few weeks ago about how inductor can choose its own output memory format, so we need inductor to tell us what the strides of the outputs are in general.
cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire
[ghstack-poisoned]
…when cloning inputs"
The problem is that for code like this
(existing test: `TestInductorOpInfoCUDA.test_comprehensive_as_strided_partial_views_cuda_float64`)
```
torch.compile
def f(x):
return x.as_strided(shape, stride)
base = torch.ones(20)
slice = base[5:15]
out = f(base)
```
Inductor has some logic to clone the inputs, but it doesn't respect the storage offset of said inputs. The above example breaks, because we will put torch.ops.aten.as_strided(x, size, stride) into the graph, and the storage_offset will be inferred from the input, to (incorrectly) be 0 instead of 5. We are also forced to use
This used to work before the current refactor, because aot autograd was not passing output-that-view-input ops into the graph at all, and instead, it would just dump the (hardcoded) sizes/strides of views as outputs into the graph. With the changes in the aot autograd PR, we're now trusting inductor to return us an output alias with the correct metadata. Which... we kinda need to, since we talked a few weeks ago about how inductor can choose its own output memory format, so we need inductor to tell us what the strides of the outputs are in general.
cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire
[ghstack-poisoned]
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
The problem is that for code like this
(existing test:
TestInductorOpInfoCUDA.test_comprehensive_as_strided_partial_views_cuda_float64)Inductor has some logic to clone the inputs, but it doesn't respect the storage offset of said inputs. The above example breaks, because we will put torch.ops.aten.as_strided(x, size, stride) into the graph, and the storage_offset will be inferred from the input, to (incorrectly) be 0 instead of 5. We are also forced to use
This used to work before the current refactor, because aot autograd was not passing output-that-view-input ops into the graph at all, and instead, it would just dump the (hardcoded) sizes/strides of views as outputs into the graph. With the changes in the aot autograd PR, we're now trusting inductor to return us an output alias with the correct metadata. Which... we kinda need to, since we talked a few weeks ago about how inductor can choose its own output memory format, so we need inductor to tell us what the strides of the outputs are in general.
Stack from ghstack (oldest at bottom):
cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @desertfire