-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Check for memory overlap in more ATen ops #39878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
💊 CI failures summary and remediationsAs of commit 60e0565 (more details on the Dr. CI page): ✅ None of the CI failures appear to be your fault 💚
❄️ 8 failures tentatively classified as flakybut reruns have not yet been triggered to confirm:
|
|
One thing I'm not sure about is (1) is it true that we should be universally checking for memory overlap in all inplace/out TensorIterator invocations, and (2) if so, why isn't this built into the API in a more structured way, rather than as an argument to each call site that you could forget. |
|
@peterbell10 please add the original author of memory overlap and the reviewers of that PR as reviewers to this PR |
|
cc @pbelevich who seems to have done most of the TensorIterator work in #22917 and #24058
Most of the time, I think so. There are some cases like reductions where the tensors are intentionally given 0-strides and so these checks would give false-positives.
Perhaps we should remove the defaults from the |
|
Removing myself as reviewer. |
|
As a side note, in >>> x = np.random.rand(100)
>>> y = np.broadcast_to(x, (10, 100))
>>> y[0, 0] = 1
ValueError: assignment destination is read-only
>>> y.flags['WRITEABLE']
False |
I'm in favor of removing the
Something to consider is that we could just default to checking memory overlap in custom tensor iterators but selectively disable it for the cases we know won't work. It sounds like reductions are one such case? |
ef62811 to
6d6f832
Compare
|
I've rebased on the latest master, changed the default in |
|
Thanks! |
zou3519
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(not a full review)
torch/distributions/multinomial.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did this have to change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a bug exposed by the new checks in masked_fill. It first broadcasts self.logits with value, then writes to the broadcasted (i.e. zero-strided) array in a way that overwrites entire rows instead of only where value == 0.
test/test_distributions.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did this have to change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The next line writes into a broadcasted array:
expected[expected == float('inf')] = 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems BC-breaking. Even if expected is a broadcasted array, expected[expected == float('inf')] = 0. produces a correct result because it is setting the right elements to 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this is a masked_fill_ which is exactly the operation in the original issue. To preserve the correct uses, like
expected[expected == float('inf')] = 0.then we could inspect the mask along the broadcasted dimensions and only allow the fill if the mask is constant along broadcasted dimensions. Maybe deprecating that behavior as well.
|
Hmm, there are some operations where checking for memory overlap isn't necessary and would cause backwards incompatibility. For example, checking memory overlap for something like However, we have cases like EDIT: another way to solve the problem is to follow numpy and make broadcasted arrays read-only. That's a pretty big change though and I am not sure what it would entail |
|
For Perhaps fill should just raise a deprecation warning instead of an error? |
|
@peterbell10 sorry, this fell off my radar -- I'll have more time to review this starting next week and will give this a read through then |
|
cc @peterbell10 I thought about the problem we’re addressing here and wrote up some comments. Please let me know what your thoughts are. Also cc @mruberry (for numpy compatibility) and @VitalyFedyunin @ngimel @ezyang (for TensorIterator). I'm curious if you folks have thoughts on the internal memory overlap problem and potential solutions. What is the issue we’re trying to solve?As a common pitfall, users will expand tensors and then perform in-place operations on them. We’ve mitigated this problem in the past by adding internal memory overlap checks to some of the more common operations (e.g., we check if a Tensor has stride 0 and size > 1). However, this problem still exists for a lot of other operators. My concerns on this Pull RequestPotential for BC-breaking behavior: There are some operators where if an output has internal memory overlap, or if an output and input have partial memory overlap, the result is always correct (e.g., where the output tensor has internal overlap, but it is OK that it does and making a change like this would be BC-breaking (in fact, there is a test in this PR that we had to change because of this). Maybe these cases don’t actually matter, but we I think if we go down the route of this PR, we should take a detailed survey of the operations involved that we would need to deprecate and have a deprecation plan. Decision to make TensorIterator check for memory overlap by default. This isn't a complete solution to our problem because not all PyTorch operations use TensorIterator or will use TensorIterator in the future. Secondly, we’ve been doing a lot of TH->ATen migrations. If we change TensorIterator to check for memory overlap by default, then TH->ATen migrations may unwillingly introduce BC-breaking changes. AlternativesI think if we have to go through the trouble of identifying all the BC-breaking operators and deprecating the "expand + modify-in-place" behavior for those operators, we should take some time to explore the problem space. Assuming we have to break BC, it'll be 6-9 months (2-3 releases) to finish deprecating and updating everything. One alternative that seems nice to me is to introduce a WRITEABLE flag to tensors, a la numpy. Broadcasted tensors (obtained from e.g. |
To go down this route I think there would need to be A huge benefit of having |
|
I'm not exactly sure what to do here. Let me describe some high level things to think about.
Hope that helps. |
Actual bug in distributions being caught by the new checks here.
98a4d04 to
60e0565
Compare
|
I've rebased and added an exception for
With numpy's flags system the user can set the writable flag manually, then write to the array as normal. Though, a more idiomatic way would be to use |
|
Now that #41923 is closed I am returning to review this. So from the tests it looks like we've added memory overlap support for the following: Did we enable memory overlap for any other operators? I feel like switching the default on TensorIterator to check memory overlap should have affected more operators |
|
Most |
|
@peterbell10, sorry for the delay in review. I had a conversation with @mruberry offline about this PR. We came to the following ideas:
From those points, I'll review the pull request from the following perspective:
|
It'd also be nice to document when functions should and shouldn't be setting the check flag. |
| ("addcdiv", True, True, 'cuda'), | ||
| ("lerp", True, True, 'cpu'), | ||
| ("lerp", False, False, 'cuda') | ||
| ("lerp", True, True, 'cuda') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did an audit of TensorIterator::unary_op, TensorIterator::binary_op, and TensorIterator::nullary_op cases in the code that use the default. The default before this PR was check_mem_overlap=False, but after this PR, those are True. I found a few that we didn't test, so it would be nice to add tests for them (to check that we've added the checks for both CPU and CUDA:
For: TensorIterator::unary_op
- elu: test in-place variant:
pytorch/aten/src/ATen/native/Activation.cpp
Lines 111 to 117 in 2b7108a
Tensor & elu_( Tensor & self, Scalar alpha, Scalar scale, Scalar input_scale) { return at::elu_out(self, self, alpha, scale, input_scale); } - hardswish: test in-place variant:
pytorch/aten/src/ATen/native/Activation.cpp
Lines 150 to 154 in 2b7108a
Tensor& hardswish_out(Tensor& result, const Tensor& self) { auto iter = TensorIterator::unary_op(result, self); hardswish_stub(iter.device_type(), iter); return result; } - leaky_relu: test in-place variant
pytorch/aten/src/ATen/native/Activation.cpp
Lines 713 to 735 in 2b7108a
Tensor& leaky_relu_out( Tensor& result, const Tensor& self, Scalar negval) { auto iter = TensorIterator::unary_op(result, self); leaky_relu_stub(iter.device_type(), iter, negval); return result; } Tensor leaky_relu( const Tensor& self, Scalar negval) { Tensor result; auto iter = TensorIterator::unary_op(result, self); leaky_relu_stub(iter.device_type(), iter, negval); return iter.output(); } Tensor & leaky_relu_( Tensor & self, Scalar neg_val) { return at::leaky_relu_out(self, self, neg_val); }
For TensorIterator::binary_op:
- index_add_: has two paths.
auto iter = TensorIterator::binary_op(selfSlice, selfSlice, sourceSlice); - atan2: test out variant and in-place variant:
pytorch/aten/src/ATen/native/BinaryOps.cpp
Lines 337 to 350 in 2b7108a
Tensor& atan2_out(Tensor& result, const Tensor& self, const Tensor& other) { auto iter = TensorIterator::binary_op(result, self, other); atan2_stub(iter.device_type(), iter); return result; } Tensor atan2(const Tensor& self, const Tensor& other) { Tensor result = at::empty({0}, self.options()); return native::atan2_out(result, self, other); } Tensor& atan2_(Tensor& self, const Tensor& other) { return native::atan2_out(self, self, other); } - ilshift:
pytorch/aten/src/ATen/native/BinaryOps.cpp
Lines 596 to 607 in 2b7108a
Tensor& __ilshift__(Tensor& self, const Tensor& other) { auto iter = TensorIterator::binary_op(self, self, other); lshift_stub(iter.device_type(), iter); return self; } Tensor& __ilshift__(Tensor& self, Scalar other) { auto wrapper = wrapped_scalar_tensor(other).toType(self.scalar_type()); auto iter = TensorIterator::binary_op(self, self, wrapper); lshift_stub(iter.device_type(), iter); return self; } - irshift:
pytorch/aten/src/ATen/native/BinaryOps.cpp
Lines 624 to 635 in 2b7108a
Tensor& __irshift__(Tensor& self, const Tensor& other) { auto iter = TensorIterator::binary_op(self, self, other); rshift_stub(iter.device_type(), iter); return self; } Tensor& __irshift__(Tensor& self, Scalar other) { auto wrapper = wrapped_scalar_tensor(other).toType(self.scalar_type()); auto iter = TensorIterator::binary_op(self, self, wrapper); rshift_stub(iter.device_type(), iter); return self; } - hypot: test out variant and in-place variant:
pytorch/aten/src/ATen/native/BinaryOps.cpp
Lines 888 to 903 in 2b7108a
Tensor& hypot_out(Tensor& result, const Tensor& self, const Tensor& other) { auto iter = TensorIterator::binary_op(result, self, other); hypot_stub(iter.device_type(), iter); return result; } Tensor hypot(const Tensor& self, const Tensor& other) { Tensor result; auto iter = TensorIterator::binary_op(result, self, other); hypot_stub(iter.device_type(), iter); return iter.output(); } Tensor& hypot_(Tensor& self, const Tensor& other) { return at::hypot_out(self, self, other); } - nextafter: test out variant and in-place variant:
pytorch/aten/src/ATen/native/BinaryOps.cpp
Lines 905 to 920 in 2b7108a
Tensor& nextafter_out(Tensor& result, const Tensor& self, const Tensor& other) { auto iter = TensorIterator::binary_op(result, self, other); nextafter_stub(iter.device_type(), iter); return result; } Tensor nextafter(const Tensor& self, const Tensor& other) { Tensor result; auto iter = TensorIterator::binary_op(result, self, other); nextafter_stub(iter.device_type(), iter); return iter.output(); } Tensor& nextafter_(Tensor& self, const Tensor& other) { return at::nextafter_out(self, self, other); } - threshold: test in-place variant:
pytorch/aten/src/ATen/native/Activation.cpp
Lines 380 to 392 in 2b7108a
Tensor threshold(const Tensor& self, Scalar threshold, Scalar value) { return threshold_out(nullopt, self, threshold, value, self); } Tensor& threshold_(Tensor& self, Scalar threshold, Scalar value) { threshold_out(make_optional(self), self, threshold, value, self); return self; } Tensor& threshold_out(Tensor& result, const Tensor& self, Scalar threshold, Scalar value) { threshold_out(make_optional(result), self, threshold, value, self); return result; }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In retrospect, because there are so many ops, I think this PR would be easier to review if we split it up into, e.g.:
- a PR to remove TensorIterator::unary_op's check_mem_overlap (and instead default to True)
- a PR to remove TensorIterator::binary_op's check_mem_overlap (and instead default to True)
- ...
- a PR that hits the switch for the default of
check_mem_overlapin TensorIterator to True.
However, I'm fine with reviewing all the changes in one single PR (this PR) because the ultimate line count will probably still be manageable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try to split this up into a clean stack of PRs. The longer each individual change takes to land, the more ops will be ported to TensorIterator with the old default and will need changing. So anything to make review go smoother should be worth it IMO.
| c10::optional<DimVector> static_shape_ = c10::nullopt; | ||
| c10::optional<std::pair<ScalarType, Device>> static_dtype_and_device_ = c10::nullopt; | ||
| bool check_mem_overlap_ = false; | ||
| bool check_mem_overlap_ = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add some prescriptive documentation to
pytorch/aten/src/ATen/native/TensorIterator.h
Lines 429 to 432 in 2b7108a
| TensorIteratorConfig& set_check_mem_overlap(bool check_mem_overlap) { | |
| check_mem_overlap_ = check_mem_overlap; | |
| return *this; | |
| } |
set_check_mem_overlap = False, and that the default is True.
zou3519
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(not a full review, I'm still going through the ops)
|
Closing as this was superseded by gh-43423 |
Fixes #39639 (and a lot more)
I went through every use of
TensorIteratorand added thecheck_mem_overlapflag to all the inplace ops, or ops usingouttensors. I only touchedTH/THCwhere the operation was implemented byTensorIteratoron the other device for consistency. Hopefully the remaining ops can be added as they are ported toATen.