Invert ownership between PyFunction and THPFunction. #22983

ezyang · 2019-07-17T18:42:16Z

Stack from ghstack:

Invert ownership between PyFunction and THPFunction. #22983 Invert ownership between PyFunction and THPFunction.

Whence this fix? Let's introduce the dramatis personae:

THPFunction - a PyObject that represents a ctx object that is passed to our custom Python autograd functions. It contains a PyFunction
PyFunction - a subclass of Function which says how to compute the actual backward function. Note that this is a closure in the sense that it is associated with a ctx object which actually contains the saved tensors. It contains a non-owning pointer to THPFunction

OK, do you see something wrong with this situation? The ownership here is totally backwards! It is PyFunction that should own the THPFunction, because a PyFunction is a closure that should own its context. To put it in other words, it is OK if a PyFunction gets freed before THPFunction, but not vice versa.

This sounds nice, but making this happen in practice is a bit trickier, because sometimes we use THPFunction to represent the PyFunction. In particular, we represent a grad_fn object in the Python binding level as the THPFunction (I can only imagine we did this because we wanted the object identity of grad_fn to line up with the class the user provided). I did not have time to untangle this situation, so for this patch, in the cases where we're improperly treating a THPFunction as a PyFunction, we simply assume that the PyFunction is still live (it usually is), and yell at the user if it's not.

I constructed this patch in a few steps:

First, I flipped the ownership between PyFunction and THPFunction, but maintained a weak pointer from
THPFunction to PyFunction so that I could have an escape hatch when the existing code improperly treated THPFunction as an owning reference to PyFunction. In some cases your code might break because of this, but most of the time the PyFunction is live and everything works out (intuitively, this makes sense, since the ctx object should only really stay live as long as you're actually going to execute the backwards, which will keep the PyFunction live).
There are some cases where PyFunction is provably live at some point in time. In that case, don't bother extracting a strong reference from the weak pointer, just pass the strong pointer along directly!
For all the other cases where we can't prove it, write a long error message to help the user out when they run into that situation.

Some subtleties:

PyFunction is a C++ object that refers to a PyObject. This means it needs
a custom deleter to handle deleting the PyObject, since you can't assume
you have the GIL when it dies.
The old test_gc_in_destructor failed our internal assert because we never
actually ran a backwards, and thus never actually materialized PyFunction.
I'm chalking this up as "misuse of API" and rewrote the test to not have
this problem.
Generalizing the issue above: this is BC-breaking for legacy autograd Functions
in the following way: previously, the PyFunction for a THPFunction was generated
immediately on construction of the Function f (which would eventually serve as
the grad_fn of an object when you finally call f(x).) This means that
operations like f.register_hook() which previously worked prior to f(x) no longer
work; you have to wait until you actually call the function to do registration.

I also removed dead Function::get_shared_ptr and Decref struct; these were previously needed for the crazy split refcounting scheme we had previously, but now are dead code.

Signed-off-by: Edward Z. Yang ezyang@fb.com

Differential Revision: D16422209

Fixes #16532 and #14960. This patch is a massive hack. The way I constructed it was I flipped the ownership between PyFunction and THPFunction, but maintained a weak pointer from THPFunction to PyFunction so all existing code works. Essentially, this patch assumes that PyFunction stays live as long as you have a THPFunction: intuitively, this makes sense, since the ctx object should only really stay live as long as you're actually going to execute the backwards, which will keep the PyFunction live. But as you can see from the presently skipped tests (specifically, test_hook_none), this is not always true. But it seems to be true for the code we care about, and that's enough for me! Some subtleties: - PyFunction is a C++ object that refers to a PyObject. This means it needs a custom deleter to handle deleting the PyObject, since you can't assume you have the GIL when it dies. - The old test_gc_in_destructor failed our internal assert because we never actually ran a backwards, and thus never actually materialized PyFunction. I'm chalking this up as "misuse of API" and rewrote the test to not have this problem. Signed-off-by: Edward Z. Yang <ezyang@fb.com>

ezyang · 2019-07-17T18:45:36Z

Subsumes #16888

ezyang · 2019-07-17T18:48:18Z

There's some subsequent cleanup after this inversion that is possible, but I want to make sure this works first.

colesbury · 2019-07-17T19:47:34Z

torch/csrc/autograd/python_function.cpp

 {
-  for (const auto& hook : self->cdata.pre_hooks()) {
+  auto cdata = self->cdata.lock();
+  TORCH_INTERNAL_ASSERT(cdata);  // could do this optionally


How is this guaranteed?

It's not :) Here's what I have in my local copy:

// Traverse and clear are required for supporting Python's GC cycle handling. static int THPFunction_traverse(THPFunction *self, visitproc visit, void *arg) { - for (const auto& hook : self->cdata.pre_hooks()) { - if (auto pyhook = dynamic_cast<PyFunctionPreHook*>(hook.get())) { - Py_VISIT(pyhook->dict); + auto cdata = self->cdata.lock(); + // cdata could be null if someone constructed a legacy function but haven't + // actually called backward() on it yet. + if (cdata) { + for (const auto& hook : cdata->pre_hooks()) { + if (auto pyhook = dynamic_cast<PyFunctionPreHook*>(hook.get())) { + Py_VISIT(pyhook->dict); + } } - } - for (const auto& hook : self->cdata.post_hooks()) { - if (auto pyhook = dynamic_cast<PyFunctionPostHook*>(hook.get())) { - Py_VISIT(pyhook->dict); + for (const auto& hook : cdata->post_hooks()) { + if (auto pyhook = dynamic_cast<PyFunctionPostHook*>(hook.get())) { + Py_VISIT(pyhook->dict); + } } }

Waiting for CI to finish on the current iteration before I push though

Are you going to remove the other TORCH_INTERNAL_ASSERT(cdata) calls? I don't think any of them are safe.

mrshenli · 2019-07-17T20:20:08Z

torch/csrc/autograd/python_function.cpp

-  self->cdata.set_next_edges(std::move(input_info.next_edges));
+  Py_INCREF(self);
+  auto cdata = std::shared_ptr<PyFunction>(new PyFunction(THPObjectPtr((PyObject*)self)), deleteFunction);
+  self->cdata = cdata;


Is it possible that this method is called multiple times on the same THPFunction, so that in some invocations self->cdata is already non-NULL?

Yeah, in the legacy case it might be non-NULL. I actually have no idea what the "right" thing to do here is, because legacy autograd function behavior is nuts (all the better for @yf225 to delete them). Check this out:

macbook-pro-116:~ ezyang$ cat abb.py import torch from torch.autograd import Function class Id(Function): def forward(self, x): self.save_for_backward(x) return x def backward(self, grad_x): x = self.saved_tensors print(x) return x f = Id() x1 = torch.zeros(1, requires_grad=True) x2 = torch.ones(1, requires_grad=True) print(f) y = f(x1) print(y.grad_fn) z = f(x2) print(z.grad_fn) y.backward() macbook-pro-116:~ ezyang$ python abb.py <__main__.Id object at 0x116d58850> <__main__.Id object at 0x116d58850> <__main__.Id object at 0x116d58850> (tensor([1.], requires_grad=True),)

The second call to f clobbers the saved tensors of the first call (in y), because f is literally the same grad_fn in both cases, even though this is totally wrong. I think the API is misused here, for a legacy Function, you should only call __apply__ on it once.

On reflection, maybe the right thing is to not allocate a new PyFunction if there already is one. I can make this adjustment, but really this program is stupid and we shouldn't support it.

I've modified legacy to warn when this happens.

Stack from [ghstack](https://github.com/ezyang/ghstack): * **#22983 Invert ownership between PyFunction and THPFunction.** Fixes #16532 and #14960. This patch is a massive hack. The way I constructed it was I flipped the ownership between PyFunction and THPFunction, but maintained a weak pointer from THPFunction to PyFunction so all existing code works. Essentially, this patch assumes that PyFunction stays live as long as you have a THPFunction: intuitively, this makes sense, since the ctx object should only really stay live as long as you're actually going to execute the backwards, which will keep the PyFunction live. But as you can see from the presently skipped tests (specifically, test_hook_none), this is not always true. But it seems to be true for the code we care about, and that's enough for me! Some subtleties: - PyFunction is a C++ object that refers to a PyObject. This means it needs a custom deleter to handle deleting the PyObject, since you can't assume you have the GIL when it dies. - The old test_gc_in_destructor failed our internal assert because we never actually ran a backwards, and thus never actually materialized PyFunction. I'm chalking this up as "misuse of API" and rewrote the test to not have this problem. - Generalizing the issue above: this is BC-breaking for legacy autograd Functions in the following way: previously, the PyFunction for a THPFunction was generated immediately on construction of the Function `f` (which would eventually serve as the `grad_fn` of an object when you finally call `f(x)`.) This means that operations like `f.register_hook()` which previously worked prior to `f(x)` no longer work; you have to wait until you actually call the function to do registration. Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Fixes #16532 and #14960. This patch is a massive hack. The way I constructed it was I flipped the ownership between PyFunction and THPFunction, but maintained a weak pointer from THPFunction to PyFunction so all existing code works. Essentially, this patch assumes that PyFunction stays live as long as you have a THPFunction: intuitively, this makes sense, since the ctx object should only really stay live as long as you're actually going to execute the backwards, which will keep the PyFunction live. But as you can see from the presently skipped tests (specifically, test_hook_none), this is not always true. But it seems to be true for the code we care about, and that's enough for me! Some subtleties: - PyFunction is a C++ object that refers to a PyObject. This means it needs a custom deleter to handle deleting the PyObject, since you can't assume you have the GIL when it dies. - The old test_gc_in_destructor failed our internal assert because we never actually ran a backwards, and thus never actually materialized PyFunction. I'm chalking this up as "misuse of API" and rewrote the test to not have this problem. Signed-off-by: Edward Z. Yang <ezyang@fb.com> ghstack-source-id: b00842b Pull Request resolved: #22983

mrshenli · 2019-07-18T20:26:55Z

@ezyang I don't see any additional concerns with this PR (except the one @colesbury mentioned above), but I am not confident to give it a stamp either. Shall we bring more people to take a look?

yf225 · 2019-07-18T21:43:15Z

test/test_autograd.py

        self.assertEqual(counter[0], 1, 'bw_hook not called')
        self.assertEqual(x.grad.data, torch.ones(5, 5) * 2)

+    @unittest.skip("Currently broken, will be fixed in https://github.com/pytorch/pytorch/pull/22925")


#22925 is merged (yay!), feel free to rebase :)

yf225 · 2019-07-18T21:53:41Z

test/test_autograd.py

+        with warnings.catch_warnings(record=True) as w:
+            z = f(x2)
+        self.assertIn('extending-torch-autograd', str(w[0].message))
+        y.backward()


Should we actually call z.backward() here, since we are actually checking whether x2.grad is populated correctly?

Also, should we add an assert showing that the clobbering happens to x1.grad when y.backward() is called after f is called twice?

y.backward() results in the clobber behavior. I'm not really interested in the functional correctness of clobbering (the behavior is nuts); I'm just mostly interested that we don't segfault in this situation.

apaszke

This generally looks good and I think it makes sense (NB it would be good to link to the explanation you wrote in another PR, because the description here is not very self-contained).

I am a bit worried about what would happen if someone was to hold on to the Python instance for too long (i.e. longer than the lifetime of the PyFunction from its use), and then try to access the attributes or reuse this. I think that would trigger your internal asserts, meaning they're not really that internal.

apaszke · 2019-07-19T00:24:12Z

torch/csrc/autograd/python_function.h

+    // The actual PyFunction (in the autograd graph) that this data was
+    // saved for.  This field may be NULL (because a user can construct
+    // a THPFunction directly from Python), but when this field is non-NULL,
+    // it is guaranteed that cdata.lock()->obj == this


Why can't we enforce that Python Function objects always have their corresponding PyFunctions by allocating one whenever you instantiate an object? That way we could avoid having a weak pointer, and could hold PyFunction* will the invariant that it would always be non-null.

You also seem to be using the assumption that .lock() succeeds quite liberally in this patch, so this comment is highly confusing for me.

Why can't we enforce that Python Function objects always have their corresponding PyFunctions by allocating one whenever you instantiate an object?

There are a few layers to this question, so I want to address them separately.

In the non-legacy case, why can't we allocate a PyFunction in THPFunction_new so that THPFunction is always allocated with an associated PyFunction? The proximal cause is dumb: it's because THPFunction only holds a weak pointer to PyFunction, so if you allocate a PyFunction inside THPFunction_new, the only pointer that would be keeping it live is a weak pointer, and it would immediately go dead once THPFunction_new returns.

"But wait, Edward!" you might say, "Doesn't THPFunction_apply have this problem too?" No! We keep an owning reference to PyFunction for the entire body of this function, and eventually we will store an OWNING reference to PyFunction in the grad_fn of the Variables we return (in process_outputs); that owning reference will keep PyFunction live when we return. But there's no way to setup this owning reference in the constructor THPFunction_new, so we have to set it up somewhere else. It's a bit finely balanced, but in the non-legacy case, as long as you don't go constructing copies of THPFunction outside of apply (it is possible to do this, but not in the public API), we maintain the invariant that every THPFunction has a corresponding PyFunction.

In the legacy case, why can't we guarantee that there is an associated PyFunction? The first answer is, the legacy case is dumb and we shouldn't support it. But digging deeper, I think the root problem is that THPFunction (the ctx object) is what we expose to the world as grad_fn, when actually we should have exposed some sort of Python binding of PyFunction. So there is a confusion: we treat THPFunction as a Python binding as PyFunction, but we ALSO treat it as the ctx object we pass in when doing autograd. These two use cases are incompatible, but right now they're smooshed together in a single object.

The suggestion you've made here is treating THPFunction as just a Python binding for PyFunction. Which is reasonable, but a much more involved change.

apaszke · 2019-07-19T00:26:28Z

torch/csrc/autograd/python_function.cpp


-  Py_INCREF((PyObject*)self);
-  return std::shared_ptr<PyFunction>(&self->cdata, Decref());
+  return self->cdata.lock();


How come we no longer need to do the incref/decref here? IIRC debug builds of Python report refcounting errors if we don't keep the C++/Python refcounts approximately in sync.

I don't know. I don't understand why it was previously needed. I guess I can build a debug build of Python and see if I still get refcounting errors. If I don't, would that be sufficient?

I have a debug build of PyTorch with this patch and it has not reported any refcounting errors.

Not a debug build of PyTorch, a debug build of CPython. I once made it such that we always use a single shared_ptr to manage PyFunction, which has started triggering Python's GC asserts and the refcounting scheme has been updated by @colesbury accordingly. We should check that this patch does not suffer from the same issues as mine.

Also, there used to be this warning that you're not supposed to take weak pointers of PyFunctions because they can be managed by multiple shared_ptr blocks. That's no longer the case after this patch, right?

apaszke · 2019-07-19T00:28:34Z

torch/csrc/autograd/python_function.cpp

+  auto cdata = self->cdata.lock();
+  // cdata could be null if someone constructed a legacy function but haven't
+  // actually called backward() on it yet.
+  if (cdata) {


nit: put the assignment into the if condition

apaszke · 2019-07-19T00:28:41Z

torch/csrc/autograd/python_function.cpp

  auto f = (THPFunction*) obj;
  auto name = std::string(Py_TYPE(f)->tp_name);
-  THPObjectPtr _legacy(PyObject_GetAttrString(obj, "_is_legacy"));
+  THPObjectPtr _legacy(PyObject_GetAttrString((PyObject*)obj, "_is_legacy"));  // CONST?!


use a nicer cast?

apaszke · 2019-07-19T00:28:56Z

test/test_nn.py


+    @unittest.skipIf(not TEST_MULTIGPU, "multi-GPU not supported")
+    @skipIfRocm
+    def test_data_parallel_function_deletion(self):


Why is this related to this PR?

This tests the original bug that spawned this line of inquiry :)

apaszke · 2019-07-19T00:36:18Z

torch/csrc/autograd/python_function.cpp

+      "from the second invocation will clobber the saved tensors from the "
+      "first invocation.  Please consider rewriting your autograd function "
+      "in the modern style; for information on the new format, please see: "
+      "https://pytorch.org/docs/stable/notes/extending.html#extending-torch-autograd");


What if we use a legacy function once, free the backward, and try to apply it again? That wouldn't trigger this warning, so is that safe, or is it just another bug that is not covered by this PR?

Legacy is nuts so I don't really know for certain, but I think that it is technically safe. Because the backward is freed, there's nothing to clobber anymore.

We could tighten up the check by marking the legacy function as "consumed" once you apply it once (and error if you apply it again), but this is all best effort warning because we opted not to remove legacy functions this release cycle...

apaszke · 2019-07-19T00:42:06Z

Also, I would be ok if we simply deleted the legacy paths today. They have been deprecated since at least 0.3 and it's just a pain to maintain them. Also, IIRC JIT will blow up if you try to mix them.

ezyang · 2019-07-19T13:51:02Z

I am a bit worried about what would happen if someone was to hold on to the Python instance for too long (i.e. longer than the lifetime of the PyFunction from its use), and then try to access the attributes or reuse this. I think that would trigger your internal asserts, meaning they're not really that internal.

Yeah, @colesbury complained about this in #22998 so I'm probably going to have to move around data little so it is guaranteed to stay live. None of the current test cases exercise the codepath but I certainly believe it's broken.

ezyang · 2019-07-19T14:14:11Z

Also, I would be ok if we simply deleted the legacy paths today. They have been deprecated since at least 0.3 and it's just a pain to maintain them. Also, IIRC JIT will blow up if you try to mix them.

Yeah, me too XD. But IIUC they are somewhat widely used so we really should have an explicit deprecation notice.

Stack from [ghstack](https://github.com/ezyang/ghstack): * **#22983 Invert ownership between PyFunction and THPFunction.** Fixes #16532 and #14960. Let us introduce our dramatis personae: * THPFunction - a PyObject that represents a ctx object that is passed to our custom Python autograd functions. It contains a PyFunction * PyFunction - a subclass of Function which says how to compute the actual backward function. Note that this is a closure in the sense that it is associated with a ctx object which actually contains the saved tensors. It contains a non-owning pointer to THPFunction OK, do you see something wrong with this situation? The ownership here is totally backwards! It is PyFunction that should own the THPFunction, because a PyFunction is a closure that should own its context. To put it in other words, it is OK if a PyFunction gets freed before THPFunction, but not vice versa. This sounds nice, but making this happen in practice is a bit trickier, because there are cases where we do rely on ownership the wrong way. This patch simply "assumes" that ownership goes the correct direction in practice. The way I constructed it was I flipped the ownership between PyFunction and THPFunction, but maintained a weak pointer from THPFunction to PyFunction so all existing code works. Essentially, this patch assumes that PyFunction stays live as long as you have a THPFunction: intuitively, this makes sense, since the ctx object should only really stay live as long as you're actually going to execute the backwards, which will keep the PyFunction live. But as you can see from the presently skipped tests (specifically, test_hook_none), this is not always true. But it seems to be true for the code we care about, and that's enough for me! Some subtleties: - PyFunction is a C++ object that refers to a PyObject. This means it needs a custom deleter to handle deleting the PyObject, since you can't assume you have the GIL when it dies. - The old test_gc_in_destructor failed our internal assert because we never actually ran a backwards, and thus never actually materialized PyFunction. I'm chalking this up as "misuse of API" and rewrote the test to not have this problem. - Generalizing the issue above: this is BC-breaking for legacy autograd Functions in the following way: previously, the PyFunction for a THPFunction was generated immediately on construction of the Function `f` (which would eventually serve as the `grad_fn` of an object when you finally call `f(x)`.) This means that operations like `f.register_hook()` which previously worked prior to `f(x)` no longer work; you have to wait until you actually call the function to do registration. Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Stack from [ghstack](https://github.com/ezyang/ghstack): * **#22983 Invert ownership between PyFunction and THPFunction.** Fixes #16532 and #14960. Whence this fix? Let's introduce the dramatis personae: * THPFunction - a PyObject that represents a ctx object that is passed to our custom Python autograd functions. It contains a PyFunction * PyFunction - a subclass of Function which says how to compute the actual backward function. Note that this is a closure in the sense that it is associated with a ctx object which actually contains the saved tensors. It contains a non-owning pointer to THPFunction OK, do you see something wrong with this situation? The ownership here is totally backwards! It is PyFunction that should own the THPFunction, because a PyFunction is a closure that should own its context. To put it in other words, it is OK if a PyFunction gets freed before THPFunction, but not vice versa. This sounds nice, but making this happen in practice is a bit trickier, because sometimes we use THPFunction to represent the PyFunction. In particular, we represent a `grad_fn` object in the Python binding level as the THPFunction (I can only imagine we did this because we wanted the object identity of `grad_fn` to line up with the class the user provided). I did not have time to untangle this situation, so for this patch, in the cases where we're improperly treating a THPFunction as a PyFunction, we simply assume that the PyFunction is still live (it usually is), and yell at the user if it's not. I constructed this patch in a few steps: 1. First, I flipped the ownership between PyFunction and THPFunction, but maintained a weak pointer from THPFunction to PyFunction so that I could have an escape hatch when the existing code improperly treated THPFunction as an owning reference to PyFunction. In some cases your code might break because of this, but most of the time the PyFunction is live and everything works out (intuitively, this makes sense, since the ctx object should only really stay live as long as you're actually going to execute the backwards, which will keep the PyFunction live). 2. There are some cases where PyFunction is *provably* live at some point in time. In that case, don't bother extracting a strong reference from the weak pointer, just pass the strong pointer along directly! 3. For all the other cases where we can't prove it, write a long error message to help the user out when they run into that situation. Some subtleties: - PyFunction is a C++ object that refers to a PyObject. This means it needs a custom deleter to handle deleting the PyObject, since you can't assume you have the GIL when it dies. - The old test_gc_in_destructor failed our internal assert because we never actually ran a backwards, and thus never actually materialized PyFunction. I'm chalking this up as "misuse of API" and rewrote the test to not have this problem. - Generalizing the issue above: this is BC-breaking for legacy autograd Functions in the following way: previously, the PyFunction for a THPFunction was generated immediately on construction of the Function `f` (which would eventually serve as the `grad_fn` of an object when you finally call `f(x)`.) This means that operations like `f.register_hook()` which previously worked prior to `f(x)` no longer work; you have to wait until you actually call the function to do registration. I also removed dead Function::get_shared_ptr and Decref struct; these were previously needed for the crazy split refcounting scheme we had previously, but now are dead code. Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Fixes #16532 and #14960. This patch is a massive hack. The way I constructed it was I flipped the ownership between PyFunction and THPFunction, but maintained a weak pointer from THPFunction to PyFunction so all existing code works. Essentially, this patch assumes that PyFunction stays live as long as you have a THPFunction: intuitively, this makes sense, since the ctx object should only really stay live as long as you're actually going to execute the backwards, which will keep the PyFunction live. But as you can see from the presently skipped tests (specifically, test_hook_none), this is not always true. But it seems to be true for the code we care about, and that's enough for me! Some subtleties: - PyFunction is a C++ object that refers to a PyObject. This means it needs a custom deleter to handle deleting the PyObject, since you can't assume you have the GIL when it dies. - The old test_gc_in_destructor failed our internal assert because we never actually ran a backwards, and thus never actually materialized PyFunction. I'm chalking this up as "misuse of API" and rewrote the test to not have this problem. Signed-off-by: Edward Z. Yang <ezyang@fb.com> ghstack-source-id: 00cfcf5 Pull Request resolved: #22983

ezyang · 2019-07-19T20:00:50Z

I've resolved all of the code review comments.

Stack from [ghstack](https://github.com/ezyang/ghstack): * **#22983 Invert ownership between PyFunction and THPFunction.** Fixes #1653, #14960 and #18619. Whence this fix? Let's introduce the dramatis personae: * THPFunction - a PyObject that represents a ctx object that is passed to our custom Python autograd functions. It contains a PyFunction * PyFunction - a subclass of Function which says how to compute the actual backward function. Note that this is a closure in the sense that it is associated with a ctx object which actually contains the saved tensors. It contains a non-owning pointer to THPFunction OK, do you see something wrong with this situation? The ownership here is totally backwards! It is PyFunction that should own the THPFunction, because a PyFunction is a closure that should own its context. To put it in other words, it is OK if a PyFunction gets freed before THPFunction, but not vice versa. This sounds nice, but making this happen in practice is a bit trickier, because sometimes we use THPFunction to represent the PyFunction. In particular, we represent a `grad_fn` object in the Python binding level as the THPFunction (I can only imagine we did this because we wanted the object identity of `grad_fn` to line up with the class the user provided). I did not have time to untangle this situation, so for this patch, in the cases where we're improperly treating a THPFunction as a PyFunction, we simply assume that the PyFunction is still live (it usually is), and yell at the user if it's not. I constructed this patch in a few steps: 1. First, I flipped the ownership between PyFunction and THPFunction, but maintained a weak pointer from THPFunction to PyFunction so that I could have an escape hatch when the existing code improperly treated THPFunction as an owning reference to PyFunction. In some cases your code might break because of this, but most of the time the PyFunction is live and everything works out (intuitively, this makes sense, since the ctx object should only really stay live as long as you're actually going to execute the backwards, which will keep the PyFunction live). 2. There are some cases where PyFunction is *provably* live at some point in time. In that case, don't bother extracting a strong reference from the weak pointer, just pass the strong pointer along directly! 3. For all the other cases where we can't prove it, write a long error message to help the user out when they run into that situation. Some subtleties: - PyFunction is a C++ object that refers to a PyObject. This means it needs a custom deleter to handle deleting the PyObject, since you can't assume you have the GIL when it dies. - The old test_gc_in_destructor failed our internal assert because we never actually ran a backwards, and thus never actually materialized PyFunction. I'm chalking this up as "misuse of API" and rewrote the test to not have this problem. - Generalizing the issue above: this is BC-breaking for legacy autograd Functions in the following way: previously, the PyFunction for a THPFunction was generated immediately on construction of the Function `f` (which would eventually serve as the `grad_fn` of an object when you finally call `f(x)`.) This means that operations like `f.register_hook()` which previously worked prior to `f(x)` no longer work; you have to wait until you actually call the function to do registration. I also removed dead Function::get_shared_ptr and Decref struct; these were previously needed for the crazy split refcounting scheme we had previously, but now are dead code. Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Fixes #16532 and #14960. This patch is a massive hack. The way I constructed it was I flipped the ownership between PyFunction and THPFunction, but maintained a weak pointer from THPFunction to PyFunction so all existing code works. Essentially, this patch assumes that PyFunction stays live as long as you have a THPFunction: intuitively, this makes sense, since the ctx object should only really stay live as long as you're actually going to execute the backwards, which will keep the PyFunction live. But as you can see from the presently skipped tests (specifically, test_hook_none), this is not always true. But it seems to be true for the code we care about, and that's enough for me! Some subtleties: - PyFunction is a C++ object that refers to a PyObject. This means it needs a custom deleter to handle deleting the PyObject, since you can't assume you have the GIL when it dies. - The old test_gc_in_destructor failed our internal assert because we never actually ran a backwards, and thus never actually materialized PyFunction. I'm chalking this up as "misuse of API" and rewrote the test to not have this problem. Signed-off-by: Edward Z. Yang <ezyang@fb.com> ghstack-source-id: c515578 Pull Request resolved: #22983

colesbury

Looks good. Some minor comments below

colesbury · 2019-07-19T20:22:42Z

torch/csrc/autograd/python_function.cpp

+  // (otherwise the condition is trivially true).  Then there is a PyFunction
+  // which contains an owning reference to this object.  But we are only
+  // allowed to clear if all owning references are gone!  Contradiction.
+  TORCH_INTERNAL_ASSERT(!self->cdata.lock());


You can use self->cdata.expired() here

Also see https://cplusplus.github.io/LWG/lwg-active.html#2751. Note that THPFunction_clear is typically called during shared_ptr destructor. Seems likely that the desired behavior will be guaranteed in future revisions so probably OK.

Thanks that is a very nice point :)

colesbury · 2019-07-19T20:23:30Z

torch/csrc/autograd/python_function.cpp

-  self->cdata.set_next_edges(std::move(input_info.next_edges));
+  Py_INCREF(self);
+  std::shared_ptr<PyFunction> cdata;
+  if (cdata = self->cdata.lock()) {


The assignment is easy to misread here. Consider refactoring it (move the assignment to the definition of cdata) or adding an explicit comparison to make it more obvious.

colesbury · 2019-07-19T20:26:14Z

torch/csrc/autograd/python_function.cpp

  auto& input_info = info_pair.second;
  bool is_executable = input_info.is_executable;
-  self->cdata.set_next_edges(std::move(input_info.next_edges));
+  Py_INCREF(self);


What is this Py_INCREF(self) for?

It's so that I can give self as an owning reference to THPObjectPtr. Which means I'm leaking in the TORCH_WARN case. Better move it in!

colesbury · 2019-07-19T20:27:07Z

torch/csrc/autograd/python_function.cpp

  auto num_args = PyTuple_GET_SIZE(inputs);
  THPObjectPtr ctx_input_tuple(PyTuple_New(num_args + 1));
-  PyTuple_SET_ITEM(ctx_input_tuple.get(), 0, ctx_obj.release());
+  Py_INCREF(ctx);


This predates your change, but the code is missing a check that ctx_input_tuple is non-null (i.e. that the PyTuple_New succeeded)

Stack from [ghstack](https://github.com/ezyang/ghstack): * **#22983 Invert ownership between PyFunction and THPFunction.** Fixes #1653, #14960 and #18619. Whence this fix? Let's introduce the dramatis personae: * THPFunction - a PyObject that represents a ctx object that is passed to our custom Python autograd functions. It contains a PyFunction * PyFunction - a subclass of Function which says how to compute the actual backward function. Note that this is a closure in the sense that it is associated with a ctx object which actually contains the saved tensors. It contains a non-owning pointer to THPFunction OK, do you see something wrong with this situation? The ownership here is totally backwards! It is PyFunction that should own the THPFunction, because a PyFunction is a closure that should own its context. To put it in other words, it is OK if a PyFunction gets freed before THPFunction, but not vice versa. This sounds nice, but making this happen in practice is a bit trickier, because sometimes we use THPFunction to represent the PyFunction. In particular, we represent a `grad_fn` object in the Python binding level as the THPFunction (I can only imagine we did this because we wanted the object identity of `grad_fn` to line up with the class the user provided). I did not have time to untangle this situation, so for this patch, in the cases where we're improperly treating a THPFunction as a PyFunction, we simply assume that the PyFunction is still live (it usually is), and yell at the user if it's not. I constructed this patch in a few steps: 1. First, I flipped the ownership between PyFunction and THPFunction, but maintained a weak pointer from THPFunction to PyFunction so that I could have an escape hatch when the existing code improperly treated THPFunction as an owning reference to PyFunction. In some cases your code might break because of this, but most of the time the PyFunction is live and everything works out (intuitively, this makes sense, since the ctx object should only really stay live as long as you're actually going to execute the backwards, which will keep the PyFunction live). 2. There are some cases where PyFunction is *provably* live at some point in time. In that case, don't bother extracting a strong reference from the weak pointer, just pass the strong pointer along directly! 3. For all the other cases where we can't prove it, write a long error message to help the user out when they run into that situation. Some subtleties: - PyFunction is a C++ object that refers to a PyObject. This means it needs a custom deleter to handle deleting the PyObject, since you can't assume you have the GIL when it dies. - The old test_gc_in_destructor failed our internal assert because we never actually ran a backwards, and thus never actually materialized PyFunction. I'm chalking this up as "misuse of API" and rewrote the test to not have this problem. - Generalizing the issue above: this is BC-breaking for legacy autograd Functions in the following way: previously, the PyFunction for a THPFunction was generated immediately on construction of the Function `f` (which would eventually serve as the `grad_fn` of an object when you finally call `f(x)`.) This means that operations like `f.register_hook()` which previously worked prior to `f(x)` no longer work; you have to wait until you actually call the function to do registration. I also removed dead Function::get_shared_ptr and Decref struct; these were previously needed for the crazy split refcounting scheme we had previously, but now are dead code. Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Fixes #16532 and #14960. This patch is a massive hack. The way I constructed it was I flipped the ownership between PyFunction and THPFunction, but maintained a weak pointer from THPFunction to PyFunction so all existing code works. Essentially, this patch assumes that PyFunction stays live as long as you have a THPFunction: intuitively, this makes sense, since the ctx object should only really stay live as long as you're actually going to execute the backwards, which will keep the PyFunction live. But as you can see from the presently skipped tests (specifically, test_hook_none), this is not always true. But it seems to be true for the code we care about, and that's enough for me! Some subtleties: - PyFunction is a C++ object that refers to a PyObject. This means it needs a custom deleter to handle deleting the PyObject, since you can't assume you have the GIL when it dies. - The old test_gc_in_destructor failed our internal assert because we never actually ran a backwards, and thus never actually materialized PyFunction. I'm chalking this up as "misuse of API" and rewrote the test to not have this problem. Signed-off-by: Edward Z. Yang <ezyang@fb.com> ghstack-source-id: a25e840 Pull Request resolved: #22983

apaszke

A few minor comments, but LGTM

apaszke · 2019-07-22T14:55:49Z

torch/csrc/autograd/functions/pybind.h

 namespace pybind11 { namespace detail {

-// handle Python <-> torch::autograd::Function conversions
-template <> struct type_caster<std::shared_ptr<torch::autograd::Function>> {


Isn't this BC breaking for people who used to poke around .grad_fn?

Well, Python side grad_fn poking still works after this patch, so that didn't exercise this code. I think this specifically only affects people who were writing pybind11 bindings for autograd functions. I have the tiniest violin.

apaszke · 2019-07-22T14:56:17Z

torch/csrc/autograd/python_function.cpp

  auto f = (THPFunction*) obj;
  auto name = std::string(Py_TYPE(f)->tp_name);
-  THPObjectPtr _legacy(PyObject_GetAttrString(obj, "_is_legacy"));
+  // Python API functions are not const-correct


They cannot be, because attribute lookups can have arbitrary side effects 😄

apaszke · 2019-07-22T14:56:35Z

torch/csrc/autograd/python_function.cpp

-    if (auto pyhook = dynamic_cast<PyFunctionPreHook*>(hook.get())) {
-      Py_VISIT(pyhook->dict);
+  // cdata could be null if someone constructed a legacy function but haven't
+  // actually called backward() on it yet.


Or if the backward function went "out of scope"

apaszke · 2019-07-22T14:58:59Z

torch/csrc/autograd/python_function.cpp

-  // trigger deletion of other objects, and they can reference this function,
-  // seeing it in a half-deleted state.
-  auto pre_hooks = std::move(self->cdata.pre_hooks());
-  auto post_hooks = std::move(self->cdata.post_hooks());


I'm afraid that this was actually important, but 🤷🏻‍♂️

I think we're clear here. Proving that I don't need to do this is a big reason why we have the TORCH_INTERNAL_ASSERT(self->cdata.expired()); assert at the top of the function. The claim is that PyFunction will always destruct first, and delete all of the hook objects, BEFORE we start deletion of THPFunction.

apaszke · 2019-07-22T15:01:38Z

torch/csrc/autograd/python_function.cpp

    }
-    THPUtils_assert(PyTuple_GET_SIZE(raw_grad_output) == self->cdata.num_inputs(),
+    auto cdata = self->cdata.lock();
+    // In obscure situations, cdata might be nullptr because it's expired.  THAT


Uhhhh are you sure? If it's expired then how did we get to do_backward? Isn't it the PyFunction that calls this method?

You can also just call _do_backward straight up. So the order of operations is, you save out the grad_fn, free the variables, and then call grad_fn._do_backward. It's dumb, not worth making it work better, but we do have an error check if this does happen :)

apaszke · 2019-07-22T15:01:56Z

torch/csrc/autograd/python_function.cpp

+      "was called.  This could occur if you manually called _do_backward on Function.  "
+      "In any case, this is very naughty!  If you absolutely need this to work, "
+      "try porting your code to use non-legacy autograd function, see: "
+      "https://pytorch.org/docs/stable/notes/extending.html#extending-torch-autograd");


Oh lol, I think this is a bit paranoid, but w/e

apaszke · 2019-07-22T15:03:10Z

torch/csrc/autograd/python_function.cpp

-  self->cdata.add_pre_hook(std::move(hook));
+  auto cdata = self->cdata.lock();
+  TORCH_CHECK(cdata,
+    "Legacy autograd function had _register_hook_dict called before the function was "


This is a very bad error message to tell the user. Please figure out which methods call this one and write something that doesn't mention internal APIs. This gives completely no guidance about the potential fixes.

On the other hand it's legacy, so I guess I don't really care.

Well, it's easy enough to fix.

apaszke · 2019-07-22T15:04:37Z

torch/csrc/autograd/python_function.cpp

+  // This is really a true assert, because we've already tested for the
+  // self->has_freed_buffers case at the beginning of this function:
+  // buffers are freed when PyFunction dies; if the buffers are not freed,
+  // PyFunction must be live.


Hmmm are you sure? I think those are orthogonal conditions. has_freed_buffers is only triggered by running backward, but you can have cdata get freed with no backward runs involved.

IIUC, it's a superset condition. If PyFunction is destructed, has_freed_buffers is guaranteed to be true, but not vice versa. That's good enough here.

Stack from [ghstack](https://github.com/ezyang/ghstack): * **#22983 Invert ownership between PyFunction and THPFunction.** Fixes #1653, #14960 and #18619. Whence this fix? Let's introduce the dramatis personae: * THPFunction - a PyObject that represents a ctx object that is passed to our custom Python autograd functions. It contains a PyFunction * PyFunction - a subclass of Function which says how to compute the actual backward function. Note that this is a closure in the sense that it is associated with a ctx object which actually contains the saved tensors. It contains a non-owning pointer to THPFunction OK, do you see something wrong with this situation? The ownership here is totally backwards! It is PyFunction that should own the THPFunction, because a PyFunction is a closure that should own its context. To put it in other words, it is OK if a PyFunction gets freed before THPFunction, but not vice versa. This sounds nice, but making this happen in practice is a bit trickier, because sometimes we use THPFunction to represent the PyFunction. In particular, we represent a `grad_fn` object in the Python binding level as the THPFunction (I can only imagine we did this because we wanted the object identity of `grad_fn` to line up with the class the user provided). I did not have time to untangle this situation, so for this patch, in the cases where we're improperly treating a THPFunction as a PyFunction, we simply assume that the PyFunction is still live (it usually is), and yell at the user if it's not. I constructed this patch in a few steps: 1. First, I flipped the ownership between PyFunction and THPFunction, but maintained a weak pointer from THPFunction to PyFunction so that I could have an escape hatch when the existing code improperly treated THPFunction as an owning reference to PyFunction. In some cases your code might break because of this, but most of the time the PyFunction is live and everything works out (intuitively, this makes sense, since the ctx object should only really stay live as long as you're actually going to execute the backwards, which will keep the PyFunction live). 2. There are some cases where PyFunction is *provably* live at some point in time. In that case, don't bother extracting a strong reference from the weak pointer, just pass the strong pointer along directly! 3. For all the other cases where we can't prove it, write a long error message to help the user out when they run into that situation. Some subtleties: - PyFunction is a C++ object that refers to a PyObject. This means it needs a custom deleter to handle deleting the PyObject, since you can't assume you have the GIL when it dies. - The old test_gc_in_destructor failed our internal assert because we never actually ran a backwards, and thus never actually materialized PyFunction. I'm chalking this up as "misuse of API" and rewrote the test to not have this problem. - Generalizing the issue above: this is BC-breaking for legacy autograd Functions in the following way: previously, the PyFunction for a THPFunction was generated immediately on construction of the Function `f` (which would eventually serve as the `grad_fn` of an object when you finally call `f(x)`.) This means that operations like `f.register_hook()` which previously worked prior to `f(x)` no longer work; you have to wait until you actually call the function to do registration. I also removed dead Function::get_shared_ptr and Decref struct; these were previously needed for the crazy split refcounting scheme we had previously, but now are dead code. Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Fixes #16532 and #14960. This patch is a massive hack. The way I constructed it was I flipped the ownership between PyFunction and THPFunction, but maintained a weak pointer from THPFunction to PyFunction so all existing code works. Essentially, this patch assumes that PyFunction stays live as long as you have a THPFunction: intuitively, this makes sense, since the ctx object should only really stay live as long as you're actually going to execute the backwards, which will keep the PyFunction live. But as you can see from the presently skipped tests (specifically, test_hook_none), this is not always true. But it seems to be true for the code we care about, and that's enough for me! Some subtleties: - PyFunction is a C++ object that refers to a PyObject. This means it needs a custom deleter to handle deleting the PyObject, since you can't assume you have the GIL when it dies. - The old test_gc_in_destructor failed our internal assert because we never actually ran a backwards, and thus never actually materialized PyFunction. I'm chalking this up as "misuse of API" and rewrote the test to not have this problem. Signed-off-by: Edward Z. Yang <ezyang@fb.com> ghstack-source-id: 75917fb Pull Request resolved: #22983

facebook-github-bot · 2019-07-22T23:46:37Z

@ezyang merged this pull request in fdfc676.

pytorchbot added module: autograd Related to torch.autograd, and the autograd engine in general module: nn Related to torch.nn module: pybind Related to our Python bindings / interactions with other Python libraries labels Jul 17, 2019

ezyang requested review from apaszke, malvika2147 and mrshenli July 17, 2019 18:43

ezyang requested a review from albanD July 17, 2019 18:45

colesbury reviewed Jul 17, 2019

View reviewed changes

ezyang requested a review from colesbury July 17, 2019 19:52

mrshenli reviewed Jul 17, 2019

View reviewed changes

ezyang requested a review from yf225 July 17, 2019 21:44

This was referenced Jul 17, 2019

Remove dead Decref struct. #22996

Closed

Remove dead Function::get_shared_ptr. #22997

Closed

Avoid weak_ptr::lock() when shared_ptr is provably live; more error checking. #22998

Closed

yf225 reviewed Jul 18, 2019

View reviewed changes

apaszke reviewed Jul 19, 2019

View reviewed changes

ezyang requested review from apaszke, mrshenli and yf225 July 19, 2019 20:00

ezyang mentioned this pull request Jul 19, 2019

Fail in repeated evaluation of 2nd derivative of a custom autograd function #18619

Closed

pietern mentioned this pull request Jul 22, 2019

Check python REFCNT before deleting functions #16888

Closed

colesbury reviewed Jul 22, 2019

View reviewed changes

apaszke approved these changes Jul 22, 2019

View reviewed changes

facebook-github-bot closed this in fdfc676 Jul 22, 2019

zou3519 deleted the gh/ezyang/237/head branch July 22, 2019 21:16

facebook-github-bot added the merged label Jul 22, 2019

This was referenced Jul 23, 2019

v1.0.0 nn.utils.weight_norm seems to nullify gradients of unrelated parameters if wrapped in DataParallel #14960

Closed

Missing gradient when autograd called inside a function on Multi-GPU (eg gradient penalty) #16532

Closed

malvika2147 mentioned this pull request Jul 24, 2019

Remove torch::autograd::Node::get_shared_ptr() #23307

Closed

yf225 mentioned this pull request Aug 2, 2019

Add warning for legacy autograd function apply() #23719

Closed

mrshenli mentioned this pull request Aug 7, 2019

torch.nn.DataParallel causes incorrect gradients #23938

Open

L0SG mentioned this pull request Oct 10, 2019

PyTorch v1.0.0 multi-GPU compatibility issue ksw0306/FloWaveNet#13

Open

mruberry added the Merged label Oct 28, 2020

Invert ownership between PyFunction and THPFunction. #22983

Invert ownership between PyFunction and THPFunction. #22983

Uh oh!

Conversation

ezyang commented Jul 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ezyang commented Jul 17, 2019

Uh oh!

ezyang commented Jul 17, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrshenli commented Jul 18, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

apaszke left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

apaszke commented Jul 19, 2019

Uh oh!

ezyang commented Jul 19, 2019

Uh oh!

ezyang commented Jul 19, 2019

Uh oh!

ezyang commented Jul 19, 2019

Uh oh!

colesbury left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ezyang commented Jul 17, 2019 •

edited

Loading