Skip to content

Conversation

@ezyang
Copy link
Contributor

@ezyang ezyang commented Jul 17, 2019

Stack from ghstack:

Fixes #1653, #14960 and #18619.

Whence this fix? Let's introduce the dramatis personae:

  • THPFunction - a PyObject that represents a ctx object that is passed to our custom Python autograd functions. It contains a PyFunction
  • PyFunction - a subclass of Function which says how to compute the actual backward function. Note that this is a closure in the sense that it is associated with a ctx object which actually contains the saved tensors. It contains a non-owning pointer to THPFunction

OK, do you see something wrong with this situation? The ownership here is totally backwards! It is PyFunction that should own the THPFunction, because a PyFunction is a closure that should own its context. To put it in other words, it is OK if a PyFunction gets freed before THPFunction, but not vice versa.

This sounds nice, but making this happen in practice is a bit trickier, because sometimes we use THPFunction to represent the PyFunction. In particular, we represent a grad_fn object in the Python binding level as the THPFunction (I can only imagine we did this because we wanted the object identity of grad_fn to line up with the class the user provided). I did not have time to untangle this situation, so for this patch, in the cases where we're improperly treating a THPFunction as a PyFunction, we simply assume that the PyFunction is still live (it usually is), and yell at the user if it's not.

I constructed this patch in a few steps:

  1. First, I flipped the ownership between PyFunction and THPFunction, but maintained a weak pointer from
    THPFunction to PyFunction so that I could have an escape hatch when the existing code improperly treated THPFunction as an owning reference to PyFunction. In some cases your code might break because of this, but most of the time the PyFunction is live and everything works out (intuitively, this makes sense, since the ctx object should only really stay live as long as you're actually going to execute the backwards, which will keep the PyFunction live).
  2. There are some cases where PyFunction is provably live at some point in time. In that case, don't bother extracting a strong reference from the weak pointer, just pass the strong pointer along directly!
  3. For all the other cases where we can't prove it, write a long error message to help the user out when they run into that situation.

Some subtleties:

  • PyFunction is a C++ object that refers to a PyObject. This means it needs
    a custom deleter to handle deleting the PyObject, since you can't assume
    you have the GIL when it dies.
  • The old test_gc_in_destructor failed our internal assert because we never
    actually ran a backwards, and thus never actually materialized PyFunction.
    I'm chalking this up as "misuse of API" and rewrote the test to not have
    this problem.
  • Generalizing the issue above: this is BC-breaking for legacy autograd Functions
    in the following way: previously, the PyFunction for a THPFunction was generated
    immediately on construction of the Function f (which would eventually serve as
    the grad_fn of an object when you finally call f(x).) This means that
    operations like f.register_hook() which previously worked prior to f(x) no longer
    work; you have to wait until you actually call the function to do registration.

I also removed dead Function::get_shared_ptr and Decref struct; these were previously needed for the crazy split refcounting scheme we had previously, but now are dead code.

Signed-off-by: Edward Z. Yang ezyang@fb.com

Differential Revision: D16422209

Fixes #16532 and #14960.

This patch is a massive hack.  The way I constructed it was I flipped the
ownership between PyFunction and THPFunction, but maintained a weak pointer from
THPFunction to PyFunction so all existing code works.  Essentially, this patch
assumes that PyFunction stays live as long as you have a THPFunction: intuitively,
this makes sense, since the ctx object should only really stay live as long as
you're actually going to execute the backwards, which will keep the PyFunction
live.  But as you can see from the presently skipped tests (specifically,
test_hook_none), this is not always true.  But it seems to be true for the code
we care about, and that's enough for me!

Some subtleties:
- PyFunction is a C++ object that refers to a PyObject.  This means it needs
  a custom deleter to handle deleting the PyObject, since you can't assume
  you have the GIL when it dies.
- The old test_gc_in_destructor failed our internal assert because we never
  actually ran a backwards, and thus never actually materialized PyFunction.
  I'm chalking this up as "misuse of API" and rewrote the test to not have
  this problem.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
@pytorchbot pytorchbot added module: autograd Related to torch.autograd, and the autograd engine in general module: nn Related to torch.nn module: pybind Related to our Python bindings / interactions with other Python libraries labels Jul 17, 2019
@ezyang
Copy link
Contributor Author

ezyang commented Jul 17, 2019

Subsumes #16888

@ezyang ezyang requested a review from albanD July 17, 2019 18:45
@ezyang
Copy link
Contributor Author

ezyang commented Jul 17, 2019

There's some subsequent cleanup after this inversion that is possible, but I want to make sure this works first.

{
for (const auto& hook : self->cdata.pre_hooks()) {
auto cdata = self->cdata.lock();
TORCH_INTERNAL_ASSERT(cdata); // could do this optionally
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this guaranteed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not :) Here's what I have in my local copy:

 // Traverse and clear are required for supporting Python's GC cycle handling.
 static int THPFunction_traverse(THPFunction *self, visitproc visit, void *arg)
 {
-  for (const auto& hook : self->cdata.pre_hooks()) {
-    if (auto pyhook = dynamic_cast<PyFunctionPreHook*>(hook.get())) {
-      Py_VISIT(pyhook->dict);
+  auto cdata = self->cdata.lock();
+  // cdata could be null if someone constructed a legacy function but haven't
+  // actually called backward() on it yet.
+  if (cdata) {
+    for (const auto& hook : cdata->pre_hooks()) {
+      if (auto pyhook = dynamic_cast<PyFunctionPreHook*>(hook.get())) {
+        Py_VISIT(pyhook->dict);
+      }
     }
-  }
-  for (const auto& hook : self->cdata.post_hooks()) {
-    if (auto pyhook = dynamic_cast<PyFunctionPostHook*>(hook.get())) {
-      Py_VISIT(pyhook->dict);
+    for (const auto& hook : cdata->post_hooks()) {
+      if (auto pyhook = dynamic_cast<PyFunctionPostHook*>(hook.get())) {
+        Py_VISIT(pyhook->dict);
+      }
     }
   }

Waiting for CI to finish on the current iteration before I push though

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you going to remove the other TORCH_INTERNAL_ASSERT(cdata) calls? I don't think any of them are safe.

@ezyang ezyang requested a review from colesbury July 17, 2019 19:52
self->cdata.set_next_edges(std::move(input_info.next_edges));
Py_INCREF(self);
auto cdata = std::shared_ptr<PyFunction>(new PyFunction(THPObjectPtr((PyObject*)self)), deleteFunction);
self->cdata = cdata;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that this method is called multiple times on the same THPFunction, so that in some invocations self->cdata is already non-NULL?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, in the legacy case it might be non-NULL. I actually have no idea what the "right" thing to do here is, because legacy autograd function behavior is nuts (all the better for @yf225 to delete them). Check this out:

macbook-pro-116:~ ezyang$ cat abb.py
import torch
from torch.autograd import Function
class Id(Function):
    def forward(self, x):
        self.save_for_backward(x)
        return x

    def backward(self, grad_x):
        x = self.saved_tensors
        print(x)
        return x

f = Id()
x1 = torch.zeros(1, requires_grad=True)
x2 = torch.ones(1, requires_grad=True)
print(f)
y = f(x1)
print(y.grad_fn)
z = f(x2)
print(z.grad_fn)
y.backward()

macbook-pro-116:~ ezyang$ python abb.py
<__main__.Id object at 0x116d58850>
<__main__.Id object at 0x116d58850>
<__main__.Id object at 0x116d58850>
(tensor([1.], requires_grad=True),)

The second call to f clobbers the saved tensors of the first call (in y), because f is literally the same grad_fn in both cases, even though this is totally wrong. I think the API is misused here, for a legacy Function, you should only call __apply__ on it once.

On reflection, maybe the right thing is to not allocate a new PyFunction if there already is one. I can make this adjustment, but really this program is stupid and we shouldn't support it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've modified legacy to warn when this happens.

Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#22983 Invert ownership between PyFunction and THPFunction.**

Fixes #16532 and #14960.

This patch is a massive hack.  The way I constructed it was I flipped the
ownership between PyFunction and THPFunction, but maintained a weak pointer from
THPFunction to PyFunction so all existing code works.  Essentially, this patch
assumes that PyFunction stays live as long as you have a THPFunction: intuitively,
this makes sense, since the ctx object should only really stay live as long as
you're actually going to execute the backwards, which will keep the PyFunction
live.  But as you can see from the presently skipped tests (specifically,
test_hook_none), this is not always true.  But it seems to be true for the code
we care about, and that's enough for me!

Some subtleties:
- PyFunction is a C++ object that refers to a PyObject.  This means it needs
  a custom deleter to handle deleting the PyObject, since you can't assume
  you have the GIL when it dies.
- The old test_gc_in_destructor failed our internal assert because we never
  actually ran a backwards, and thus never actually materialized PyFunction.
  I'm chalking this up as "misuse of API" and rewrote the test to not have
  this problem.
- Generalizing the issue above: this is BC-breaking for legacy autograd Functions
  in the following way: previously, the PyFunction for a THPFunction was generated
  immediately on construction of the Function `f` (which would eventually serve as
  the `grad_fn` of an object when you finally call `f(x)`.) This means that
  operations like `f.register_hook()` which previously worked prior to `f(x)` no longer
  work; you have to wait until you actually call the function to do registration.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
ezyang added a commit that referenced this pull request Jul 17, 2019
Fixes #16532 and #14960.

This patch is a massive hack.  The way I constructed it was I flipped the
ownership between PyFunction and THPFunction, but maintained a weak pointer from
THPFunction to PyFunction so all existing code works.  Essentially, this patch
assumes that PyFunction stays live as long as you have a THPFunction: intuitively,
this makes sense, since the ctx object should only really stay live as long as
you're actually going to execute the backwards, which will keep the PyFunction
live.  But as you can see from the presently skipped tests (specifically,
test_hook_none), this is not always true.  But it seems to be true for the code
we care about, and that's enough for me!

Some subtleties:
- PyFunction is a C++ object that refers to a PyObject.  This means it needs
  a custom deleter to handle deleting the PyObject, since you can't assume
  you have the GIL when it dies.
- The old test_gc_in_destructor failed our internal assert because we never
  actually ran a backwards, and thus never actually materialized PyFunction.
  I'm chalking this up as "misuse of API" and rewrote the test to not have
  this problem.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

ghstack-source-id: b00842b
Pull Request resolved: #22983
@ezyang ezyang requested a review from yf225 July 17, 2019 21:44
@mrshenli
Copy link
Contributor

@ezyang I don't see any additional concerns with this PR (except the one @colesbury mentioned above), but I am not confident to give it a stamp either. Shall we bring more people to take a look?

self.assertEqual(counter[0], 1, 'bw_hook not called')
self.assertEqual(x.grad.data, torch.ones(5, 5) * 2)

@unittest.skip("Currently broken, will be fixed in https://github.com/pytorch/pytorch/pull/22925")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#22925 is merged (yay!), feel free to rebase :)

with warnings.catch_warnings(record=True) as w:
z = f(x2)
self.assertIn('extending-torch-autograd', str(w[0].message))
y.backward()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we actually call z.backward() here, since we are actually checking whether x2.grad is populated correctly?

Also, should we add an assert showing that the clobbering happens to x1.grad when y.backward() is called after f is called twice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

y.backward() results in the clobber behavior. I'm not really interested in the functional correctness of clobbering (the behavior is nuts); I'm just mostly interested that we don't segfault in this situation.

Copy link
Contributor

@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generally looks good and I think it makes sense (NB it would be good to link to the explanation you wrote in another PR, because the description here is not very self-contained).

I am a bit worried about what would happen if someone was to hold on to the Python instance for too long (i.e. longer than the lifetime of the PyFunction from its use), and then try to access the attributes or reuse this. I think that would trigger your internal asserts, meaning they're not really that internal.

// The actual PyFunction (in the autograd graph) that this data was
// saved for. This field may be NULL (because a user can construct
// a THPFunction directly from Python), but when this field is non-NULL,
// it is guaranteed that cdata.lock()->obj == this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we enforce that Python Function objects always have their corresponding PyFunctions by allocating one whenever you instantiate an object? That way we could avoid having a weak pointer, and could hold PyFunction* will the invariant that it would always be non-null.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also seem to be using the assumption that .lock() succeeds quite liberally in this patch, so this comment is highly confusing for me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we enforce that Python Function objects always have their corresponding PyFunctions by allocating one whenever you instantiate an object?

There are a few layers to this question, so I want to address them separately.

In the non-legacy case, why can't we allocate a PyFunction in THPFunction_new so that THPFunction is always allocated with an associated PyFunction? The proximal cause is dumb: it's because THPFunction only holds a weak pointer to PyFunction, so if you allocate a PyFunction inside THPFunction_new, the only pointer that would be keeping it live is a weak pointer, and it would immediately go dead once THPFunction_new returns.

"But wait, Edward!" you might say, "Doesn't THPFunction_apply have this problem too?" No! We keep an owning reference to PyFunction for the entire body of this function, and eventually we will store an OWNING reference to PyFunction in the grad_fn of the Variables we return (in process_outputs); that owning reference will keep PyFunction live when we return. But there's no way to setup this owning reference in the constructor THPFunction_new, so we have to set it up somewhere else. It's a bit finely balanced, but in the non-legacy case, as long as you don't go constructing copies of THPFunction outside of apply (it is possible to do this, but not in the public API), we maintain the invariant that every THPFunction has a corresponding PyFunction.

In the legacy case, why can't we guarantee that there is an associated PyFunction? The first answer is, the legacy case is dumb and we shouldn't support it. But digging deeper, I think the root problem is that THPFunction (the ctx object) is what we expose to the world as grad_fn, when actually we should have exposed some sort of Python binding of PyFunction. So there is a confusion: we treat THPFunction as a Python binding as PyFunction, but we ALSO treat it as the ctx object we pass in when doing autograd. These two use cases are incompatible, but right now they're smooshed together in a single object.

The suggestion you've made here is treating THPFunction as just a Python binding for PyFunction. Which is reasonable, but a much more involved change.


Py_INCREF((PyObject*)self);
return std::shared_ptr<PyFunction>(&self->cdata, Decref());
return self->cdata.lock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come we no longer need to do the incref/decref here? IIRC debug builds of Python report refcounting errors if we don't keep the C++/Python refcounts approximately in sync.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. I don't understand why it was previously needed. I guess I can build a debug build of Python and see if I still get refcounting errors. If I don't, would that be sufficient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a debug build of PyTorch with this patch and it has not reported any refcounting errors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a debug build of PyTorch, a debug build of CPython. I once made it such that we always use a single shared_ptr to manage PyFunction, which has started triggering Python's GC asserts and the refcounting scheme has been updated by @colesbury accordingly. We should check that this patch does not suffer from the same issues as mine.

Also, there used to be this warning that you're not supposed to take weak pointers of PyFunctions because they can be managed by multiple shared_ptr blocks. That's no longer the case after this patch, right?

auto cdata = self->cdata.lock();
// cdata could be null if someone constructed a legacy function but haven't
// actually called backward() on it yet.
if (cdata) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: put the assignment into the if condition

auto f = (THPFunction*) obj;
auto name = std::string(Py_TYPE(f)->tp_name);
THPObjectPtr _legacy(PyObject_GetAttrString(obj, "_is_legacy"));
THPObjectPtr _legacy(PyObject_GetAttrString((PyObject*)obj, "_is_legacy")); // CONST?!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use a nicer cast?


@unittest.skipIf(not TEST_MULTIGPU, "multi-GPU not supported")
@skipIfRocm
def test_data_parallel_function_deletion(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this related to this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tests the original bug that spawned this line of inquiry :)

"from the second invocation will clobber the saved tensors from the "
"first invocation. Please consider rewriting your autograd function "
"in the modern style; for information on the new format, please see: "
"https://pytorch.org/docs/stable/notes/extending.html#extending-torch-autograd");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we use a legacy function once, free the backward, and try to apply it again? That wouldn't trigger this warning, so is that safe, or is it just another bug that is not covered by this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Legacy is nuts so I don't really know for certain, but I think that it is technically safe. Because the backward is freed, there's nothing to clobber anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could tighten up the check by marking the legacy function as "consumed" once you apply it once (and error if you apply it again), but this is all best effort warning because we opted not to remove legacy functions this release cycle...

@apaszke
Copy link
Contributor

apaszke commented Jul 19, 2019

Also, I would be ok if we simply deleted the legacy paths today. They have been deprecated since at least 0.3 and it's just a pain to maintain them. Also, IIRC JIT will blow up if you try to mix them.

@ezyang
Copy link
Contributor Author

ezyang commented Jul 19, 2019

I am a bit worried about what would happen if someone was to hold on to the Python instance for too long (i.e. longer than the lifetime of the PyFunction from its use), and then try to access the attributes or reuse this. I think that would trigger your internal asserts, meaning they're not really that internal.

Yeah, @colesbury complained about this in #22998 so I'm probably going to have to move around data little so it is guaranteed to stay live. None of the current test cases exercise the codepath but I certainly believe it's broken.

@ezyang
Copy link
Contributor Author

ezyang commented Jul 19, 2019

Also, I would be ok if we simply deleted the legacy paths today. They have been deprecated since at least 0.3 and it's just a pain to maintain them. Also, IIRC JIT will blow up if you try to mix them.

Yeah, me too XD. But IIUC they are somewhat widely used so we really should have an explicit deprecation notice.

Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#22983 Invert ownership between PyFunction and THPFunction.**

Fixes #16532 and #14960.

Let us introduce our dramatis personae:

* THPFunction - a PyObject that represents a ctx object that is passed to our custom Python autograd functions. It contains a PyFunction
* PyFunction - a subclass of Function which says how to compute the actual backward function. Note that this is a closure in the sense that it is associated with a ctx object which actually contains the saved tensors. It contains a non-owning pointer to THPFunction

OK, do you see something wrong with this situation? The ownership here is totally backwards! It is PyFunction that should own the THPFunction, because a PyFunction is a closure that should own its context. To put it in other words, it is OK if a PyFunction gets freed before THPFunction, but not vice versa.

This sounds nice, but making this happen in practice is a bit trickier, because there are cases where we do rely on ownership the wrong way.  This patch simply "assumes" that ownership goes the correct direction in practice.  The way I constructed it was I flipped the
ownership between PyFunction and THPFunction, but maintained a weak pointer from
THPFunction to PyFunction so all existing code works.  Essentially, this patch
assumes that PyFunction stays live as long as you have a THPFunction: intuitively,
this makes sense, since the ctx object should only really stay live as long as
you're actually going to execute the backwards, which will keep the PyFunction
live.  But as you can see from the presently skipped tests (specifically,
test_hook_none), this is not always true.  But it seems to be true for the code
we care about, and that's enough for me!

Some subtleties:
- PyFunction is a C++ object that refers to a PyObject.  This means it needs
  a custom deleter to handle deleting the PyObject, since you can't assume
  you have the GIL when it dies.
- The old test_gc_in_destructor failed our internal assert because we never
  actually ran a backwards, and thus never actually materialized PyFunction.
  I'm chalking this up as "misuse of API" and rewrote the test to not have
  this problem.
- Generalizing the issue above: this is BC-breaking for legacy autograd Functions
  in the following way: previously, the PyFunction for a THPFunction was generated
  immediately on construction of the Function `f` (which would eventually serve as
  the `grad_fn` of an object when you finally call `f(x)`.) This means that
  operations like `f.register_hook()` which previously worked prior to `f(x)` no longer
  work; you have to wait until you actually call the function to do registration.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#22983 Invert ownership between PyFunction and THPFunction.**

Fixes #16532 and #14960.

Whence this fix? Let's introduce the dramatis personae:

* THPFunction - a PyObject that represents a ctx object that is passed to our custom Python autograd functions. It contains a PyFunction
* PyFunction - a subclass of Function which says how to compute the actual backward function. Note that this is a closure in the sense that it is associated with a ctx object which actually contains the saved tensors. It contains a non-owning pointer to THPFunction

OK, do you see something wrong with this situation? The ownership here is totally backwards! It is PyFunction that should own the THPFunction, because a PyFunction is a closure that should own its context. To put it in other words, it is OK if a PyFunction gets freed before THPFunction, but not vice versa.

This sounds nice, but making this happen in practice is a bit trickier, because sometimes we use THPFunction to represent the PyFunction. In particular, we represent a `grad_fn` object in the Python binding level as the THPFunction (I can only imagine we did this because we wanted the object identity of `grad_fn` to line up with the class the user provided).  I did not have time to untangle this situation, so for this patch, in the cases where we're improperly treating a THPFunction as a PyFunction, we simply assume that the PyFunction is still live (it usually is), and yell at the user if it's not.

I constructed this patch in a few steps:
1. First, I flipped the ownership between PyFunction and THPFunction, but maintained a weak pointer from
THPFunction to PyFunction so that I could have an escape hatch when the existing code improperly treated THPFunction as an owning reference to PyFunction. In some cases your code might break because of this, but most of the time the PyFunction is live and everything works out (intuitively, this makes sense, since the ctx object should only really stay live as long as you're actually going to execute the backwards, which will keep the PyFunction live).
2. There are some cases where PyFunction is *provably* live at some point in time. In that case, don't bother extracting a strong reference from the weak pointer, just pass the strong pointer along directly!
3. For all the other cases where we can't prove it, write a long error message to help the user out when they run into that situation.

Some subtleties:
- PyFunction is a C++ object that refers to a PyObject.  This means it needs
  a custom deleter to handle deleting the PyObject, since you can't assume
  you have the GIL when it dies.
- The old test_gc_in_destructor failed our internal assert because we never
  actually ran a backwards, and thus never actually materialized PyFunction.
  I'm chalking this up as "misuse of API" and rewrote the test to not have
  this problem.
- Generalizing the issue above: this is BC-breaking for legacy autograd Functions
  in the following way: previously, the PyFunction for a THPFunction was generated
  immediately on construction of the Function `f` (which would eventually serve as
  the `grad_fn` of an object when you finally call `f(x)`.) This means that
  operations like `f.register_hook()` which previously worked prior to `f(x)` no longer
  work; you have to wait until you actually call the function to do registration.

I also removed dead Function::get_shared_ptr and Decref struct; these were previously needed for the crazy split refcounting scheme we had previously, but now are dead code.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#22983 Invert ownership between PyFunction and THPFunction.**

Fixes #16532 and #14960.

Whence this fix? Let's introduce the dramatis personae:

* THPFunction - a PyObject that represents a ctx object that is passed to our custom Python autograd functions. It contains a PyFunction
* PyFunction - a subclass of Function which says how to compute the actual backward function. Note that this is a closure in the sense that it is associated with a ctx object which actually contains the saved tensors. It contains a non-owning pointer to THPFunction

OK, do you see something wrong with this situation? The ownership here is totally backwards! It is PyFunction that should own the THPFunction, because a PyFunction is a closure that should own its context. To put it in other words, it is OK if a PyFunction gets freed before THPFunction, but not vice versa.

This sounds nice, but making this happen in practice is a bit trickier, because sometimes we use THPFunction to represent the PyFunction. In particular, we represent a `grad_fn` object in the Python binding level as the THPFunction (I can only imagine we did this because we wanted the object identity of `grad_fn` to line up with the class the user provided).  I did not have time to untangle this situation, so for this patch, in the cases where we're improperly treating a THPFunction as a PyFunction, we simply assume that the PyFunction is still live (it usually is), and yell at the user if it's not.

I constructed this patch in a few steps:
1. First, I flipped the ownership between PyFunction and THPFunction, but maintained a weak pointer from
THPFunction to PyFunction so that I could have an escape hatch when the existing code improperly treated THPFunction as an owning reference to PyFunction. In some cases your code might break because of this, but most of the time the PyFunction is live and everything works out (intuitively, this makes sense, since the ctx object should only really stay live as long as you're actually going to execute the backwards, which will keep the PyFunction live).
2. There are some cases where PyFunction is *provably* live at some point in time. In that case, don't bother extracting a strong reference from the weak pointer, just pass the strong pointer along directly!
3. For all the other cases where we can't prove it, write a long error message to help the user out when they run into that situation.

Some subtleties:
- PyFunction is a C++ object that refers to a PyObject.  This means it needs
  a custom deleter to handle deleting the PyObject, since you can't assume
  you have the GIL when it dies.
- The old test_gc_in_destructor failed our internal assert because we never
  actually ran a backwards, and thus never actually materialized PyFunction.
  I'm chalking this up as "misuse of API" and rewrote the test to not have
  this problem.
- Generalizing the issue above: this is BC-breaking for legacy autograd Functions
  in the following way: previously, the PyFunction for a THPFunction was generated
  immediately on construction of the Function `f` (which would eventually serve as
  the `grad_fn` of an object when you finally call `f(x)`.) This means that
  operations like `f.register_hook()` which previously worked prior to `f(x)` no longer
  work; you have to wait until you actually call the function to do registration.

I also removed dead Function::get_shared_ptr and Decref struct; these were previously needed for the crazy split refcounting scheme we had previously, but now are dead code.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#22983 Invert ownership between PyFunction and THPFunction.**

Fixes #16532 and #14960.

Whence this fix? Let's introduce the dramatis personae:

* THPFunction - a PyObject that represents a ctx object that is passed to our custom Python autograd functions. It contains a PyFunction
* PyFunction - a subclass of Function which says how to compute the actual backward function. Note that this is a closure in the sense that it is associated with a ctx object which actually contains the saved tensors. It contains a non-owning pointer to THPFunction

OK, do you see something wrong with this situation? The ownership here is totally backwards! It is PyFunction that should own the THPFunction, because a PyFunction is a closure that should own its context. To put it in other words, it is OK if a PyFunction gets freed before THPFunction, but not vice versa.

This sounds nice, but making this happen in practice is a bit trickier, because sometimes we use THPFunction to represent the PyFunction. In particular, we represent a `grad_fn` object in the Python binding level as the THPFunction (I can only imagine we did this because we wanted the object identity of `grad_fn` to line up with the class the user provided).  I did not have time to untangle this situation, so for this patch, in the cases where we're improperly treating a THPFunction as a PyFunction, we simply assume that the PyFunction is still live (it usually is), and yell at the user if it's not.

I constructed this patch in a few steps:
1. First, I flipped the ownership between PyFunction and THPFunction, but maintained a weak pointer from
THPFunction to PyFunction so that I could have an escape hatch when the existing code improperly treated THPFunction as an owning reference to PyFunction. In some cases your code might break because of this, but most of the time the PyFunction is live and everything works out (intuitively, this makes sense, since the ctx object should only really stay live as long as you're actually going to execute the backwards, which will keep the PyFunction live).
2. There are some cases where PyFunction is *provably* live at some point in time. In that case, don't bother extracting a strong reference from the weak pointer, just pass the strong pointer along directly!
3. For all the other cases where we can't prove it, write a long error message to help the user out when they run into that situation.

Some subtleties:
- PyFunction is a C++ object that refers to a PyObject.  This means it needs
  a custom deleter to handle deleting the PyObject, since you can't assume
  you have the GIL when it dies.
- The old test_gc_in_destructor failed our internal assert because we never
  actually ran a backwards, and thus never actually materialized PyFunction.
  I'm chalking this up as "misuse of API" and rewrote the test to not have
  this problem.
- Generalizing the issue above: this is BC-breaking for legacy autograd Functions
  in the following way: previously, the PyFunction for a THPFunction was generated
  immediately on construction of the Function `f` (which would eventually serve as
  the `grad_fn` of an object when you finally call `f(x)`.) This means that
  operations like `f.register_hook()` which previously worked prior to `f(x)` no longer
  work; you have to wait until you actually call the function to do registration.

I also removed dead Function::get_shared_ptr and Decref struct; these were previously needed for the crazy split refcounting scheme we had previously, but now are dead code.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
ezyang added a commit that referenced this pull request Jul 19, 2019
Fixes #16532 and #14960.

This patch is a massive hack.  The way I constructed it was I flipped the
ownership between PyFunction and THPFunction, but maintained a weak pointer from
THPFunction to PyFunction so all existing code works.  Essentially, this patch
assumes that PyFunction stays live as long as you have a THPFunction: intuitively,
this makes sense, since the ctx object should only really stay live as long as
you're actually going to execute the backwards, which will keep the PyFunction
live.  But as you can see from the presently skipped tests (specifically,
test_hook_none), this is not always true.  But it seems to be true for the code
we care about, and that's enough for me!

Some subtleties:
- PyFunction is a C++ object that refers to a PyObject.  This means it needs
  a custom deleter to handle deleting the PyObject, since you can't assume
  you have the GIL when it dies.
- The old test_gc_in_destructor failed our internal assert because we never
  actually ran a backwards, and thus never actually materialized PyFunction.
  I'm chalking this up as "misuse of API" and rewrote the test to not have
  this problem.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

ghstack-source-id: 00cfcf5
Pull Request resolved: #22983
@ezyang ezyang requested review from apaszke, mrshenli and yf225 July 19, 2019 20:00
@ezyang
Copy link
Contributor Author

ezyang commented Jul 19, 2019

I've resolved all of the code review comments.

Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#22983 Invert ownership between PyFunction and THPFunction.**

Fixes #1653, #14960 and #18619.

Whence this fix? Let's introduce the dramatis personae:

* THPFunction - a PyObject that represents a ctx object that is passed to our custom Python autograd functions. It contains a PyFunction
* PyFunction - a subclass of Function which says how to compute the actual backward function. Note that this is a closure in the sense that it is associated with a ctx object which actually contains the saved tensors. It contains a non-owning pointer to THPFunction

OK, do you see something wrong with this situation? The ownership here is totally backwards! It is PyFunction that should own the THPFunction, because a PyFunction is a closure that should own its context. To put it in other words, it is OK if a PyFunction gets freed before THPFunction, but not vice versa.

This sounds nice, but making this happen in practice is a bit trickier, because sometimes we use THPFunction to represent the PyFunction. In particular, we represent a `grad_fn` object in the Python binding level as the THPFunction (I can only imagine we did this because we wanted the object identity of `grad_fn` to line up with the class the user provided).  I did not have time to untangle this situation, so for this patch, in the cases where we're improperly treating a THPFunction as a PyFunction, we simply assume that the PyFunction is still live (it usually is), and yell at the user if it's not.

I constructed this patch in a few steps:
1. First, I flipped the ownership between PyFunction and THPFunction, but maintained a weak pointer from
THPFunction to PyFunction so that I could have an escape hatch when the existing code improperly treated THPFunction as an owning reference to PyFunction. In some cases your code might break because of this, but most of the time the PyFunction is live and everything works out (intuitively, this makes sense, since the ctx object should only really stay live as long as you're actually going to execute the backwards, which will keep the PyFunction live).
2. There are some cases where PyFunction is *provably* live at some point in time. In that case, don't bother extracting a strong reference from the weak pointer, just pass the strong pointer along directly!
3. For all the other cases where we can't prove it, write a long error message to help the user out when they run into that situation.

Some subtleties:
- PyFunction is a C++ object that refers to a PyObject.  This means it needs
  a custom deleter to handle deleting the PyObject, since you can't assume
  you have the GIL when it dies.
- The old test_gc_in_destructor failed our internal assert because we never
  actually ran a backwards, and thus never actually materialized PyFunction.
  I'm chalking this up as "misuse of API" and rewrote the test to not have
  this problem.
- Generalizing the issue above: this is BC-breaking for legacy autograd Functions
  in the following way: previously, the PyFunction for a THPFunction was generated
  immediately on construction of the Function `f` (which would eventually serve as
  the `grad_fn` of an object when you finally call `f(x)`.) This means that
  operations like `f.register_hook()` which previously worked prior to `f(x)` no longer
  work; you have to wait until you actually call the function to do registration.

I also removed dead Function::get_shared_ptr and Decref struct; these were previously needed for the crazy split refcounting scheme we had previously, but now are dead code.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
ezyang added a commit that referenced this pull request Jul 19, 2019
Fixes #16532 and #14960.

This patch is a massive hack.  The way I constructed it was I flipped the
ownership between PyFunction and THPFunction, but maintained a weak pointer from
THPFunction to PyFunction so all existing code works.  Essentially, this patch
assumes that PyFunction stays live as long as you have a THPFunction: intuitively,
this makes sense, since the ctx object should only really stay live as long as
you're actually going to execute the backwards, which will keep the PyFunction
live.  But as you can see from the presently skipped tests (specifically,
test_hook_none), this is not always true.  But it seems to be true for the code
we care about, and that's enough for me!

Some subtleties:
- PyFunction is a C++ object that refers to a PyObject.  This means it needs
  a custom deleter to handle deleting the PyObject, since you can't assume
  you have the GIL when it dies.
- The old test_gc_in_destructor failed our internal assert because we never
  actually ran a backwards, and thus never actually materialized PyFunction.
  I'm chalking this up as "misuse of API" and rewrote the test to not have
  this problem.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

ghstack-source-id: c515578
Pull Request resolved: #22983
Copy link
Member

@colesbury colesbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Some minor comments below

// (otherwise the condition is trivially true). Then there is a PyFunction
// which contains an owning reference to this object. But we are only
// allowed to clear if all owning references are gone! Contradiction.
TORCH_INTERNAL_ASSERT(!self->cdata.lock());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use self->cdata.expired() here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also see https://cplusplus.github.io/LWG/lwg-active.html#2751. Note that THPFunction_clear is typically called during shared_ptr destructor. Seems likely that the desired behavior will be guaranteed in future revisions so probably OK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks that is a very nice point :)

self->cdata.set_next_edges(std::move(input_info.next_edges));
Py_INCREF(self);
std::shared_ptr<PyFunction> cdata;
if (cdata = self->cdata.lock()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assignment is easy to misread here. Consider refactoring it (move the assignment to the definition of cdata) or adding an explicit comparison to make it more obvious.

auto& input_info = info_pair.second;
bool is_executable = input_info.is_executable;
self->cdata.set_next_edges(std::move(input_info.next_edges));
Py_INCREF(self);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this Py_INCREF(self) for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's so that I can give self as an owning reference to THPObjectPtr. Which means I'm leaking in the TORCH_WARN case. Better move it in!

auto num_args = PyTuple_GET_SIZE(inputs);
THPObjectPtr ctx_input_tuple(PyTuple_New(num_args + 1));
PyTuple_SET_ITEM(ctx_input_tuple.get(), 0, ctx_obj.release());
Py_INCREF(ctx);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This predates your change, but the code is missing a check that ctx_input_tuple is non-null (i.e. that the PyTuple_New succeeded)

Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#22983 Invert ownership between PyFunction and THPFunction.**

Fixes #1653, #14960 and #18619.

Whence this fix? Let's introduce the dramatis personae:

* THPFunction - a PyObject that represents a ctx object that is passed to our custom Python autograd functions. It contains a PyFunction
* PyFunction - a subclass of Function which says how to compute the actual backward function. Note that this is a closure in the sense that it is associated with a ctx object which actually contains the saved tensors. It contains a non-owning pointer to THPFunction

OK, do you see something wrong with this situation? The ownership here is totally backwards! It is PyFunction that should own the THPFunction, because a PyFunction is a closure that should own its context. To put it in other words, it is OK if a PyFunction gets freed before THPFunction, but not vice versa.

This sounds nice, but making this happen in practice is a bit trickier, because sometimes we use THPFunction to represent the PyFunction. In particular, we represent a `grad_fn` object in the Python binding level as the THPFunction (I can only imagine we did this because we wanted the object identity of `grad_fn` to line up with the class the user provided).  I did not have time to untangle this situation, so for this patch, in the cases where we're improperly treating a THPFunction as a PyFunction, we simply assume that the PyFunction is still live (it usually is), and yell at the user if it's not.

I constructed this patch in a few steps:
1. First, I flipped the ownership between PyFunction and THPFunction, but maintained a weak pointer from
THPFunction to PyFunction so that I could have an escape hatch when the existing code improperly treated THPFunction as an owning reference to PyFunction. In some cases your code might break because of this, but most of the time the PyFunction is live and everything works out (intuitively, this makes sense, since the ctx object should only really stay live as long as you're actually going to execute the backwards, which will keep the PyFunction live).
2. There are some cases where PyFunction is *provably* live at some point in time. In that case, don't bother extracting a strong reference from the weak pointer, just pass the strong pointer along directly!
3. For all the other cases where we can't prove it, write a long error message to help the user out when they run into that situation.

Some subtleties:
- PyFunction is a C++ object that refers to a PyObject.  This means it needs
  a custom deleter to handle deleting the PyObject, since you can't assume
  you have the GIL when it dies.
- The old test_gc_in_destructor failed our internal assert because we never
  actually ran a backwards, and thus never actually materialized PyFunction.
  I'm chalking this up as "misuse of API" and rewrote the test to not have
  this problem.
- Generalizing the issue above: this is BC-breaking for legacy autograd Functions
  in the following way: previously, the PyFunction for a THPFunction was generated
  immediately on construction of the Function `f` (which would eventually serve as
  the `grad_fn` of an object when you finally call `f(x)`.) This means that
  operations like `f.register_hook()` which previously worked prior to `f(x)` no longer
  work; you have to wait until you actually call the function to do registration.

I also removed dead Function::get_shared_ptr and Decref struct; these were previously needed for the crazy split refcounting scheme we had previously, but now are dead code.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
ezyang added a commit that referenced this pull request Jul 22, 2019
Fixes #16532 and #14960.

This patch is a massive hack.  The way I constructed it was I flipped the
ownership between PyFunction and THPFunction, but maintained a weak pointer from
THPFunction to PyFunction so all existing code works.  Essentially, this patch
assumes that PyFunction stays live as long as you have a THPFunction: intuitively,
this makes sense, since the ctx object should only really stay live as long as
you're actually going to execute the backwards, which will keep the PyFunction
live.  But as you can see from the presently skipped tests (specifically,
test_hook_none), this is not always true.  But it seems to be true for the code
we care about, and that's enough for me!

Some subtleties:
- PyFunction is a C++ object that refers to a PyObject.  This means it needs
  a custom deleter to handle deleting the PyObject, since you can't assume
  you have the GIL when it dies.
- The old test_gc_in_destructor failed our internal assert because we never
  actually ran a backwards, and thus never actually materialized PyFunction.
  I'm chalking this up as "misuse of API" and rewrote the test to not have
  this problem.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

ghstack-source-id: a25e840
Pull Request resolved: #22983
Copy link
Contributor

@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments, but LGTM

namespace pybind11 { namespace detail {

// handle Python <-> torch::autograd::Function conversions
template <> struct type_caster<std::shared_ptr<torch::autograd::Function>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this BC breaking for people who used to poke around .grad_fn?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, Python side grad_fn poking still works after this patch, so that didn't exercise this code. I think this specifically only affects people who were writing pybind11 bindings for autograd functions. I have the tiniest violin.

auto f = (THPFunction*) obj;
auto name = std::string(Py_TYPE(f)->tp_name);
THPObjectPtr _legacy(PyObject_GetAttrString(obj, "_is_legacy"));
// Python API functions are not const-correct
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They cannot be, because attribute lookups can have arbitrary side effects 😄

if (auto pyhook = dynamic_cast<PyFunctionPreHook*>(hook.get())) {
Py_VISIT(pyhook->dict);
// cdata could be null if someone constructed a legacy function but haven't
// actually called backward() on it yet.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or if the backward function went "out of scope"

// trigger deletion of other objects, and they can reference this function,
// seeing it in a half-deleted state.
auto pre_hooks = std::move(self->cdata.pre_hooks());
auto post_hooks = std::move(self->cdata.post_hooks());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid that this was actually important, but 🤷🏻‍♂️

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're clear here. Proving that I don't need to do this is a big reason why we have the TORCH_INTERNAL_ASSERT(self->cdata.expired()); assert at the top of the function. The claim is that PyFunction will always destruct first, and delete all of the hook objects, BEFORE we start deletion of THPFunction.

}
THPUtils_assert(PyTuple_GET_SIZE(raw_grad_output) == self->cdata.num_inputs(),
auto cdata = self->cdata.lock();
// In obscure situations, cdata might be nullptr because it's expired. THAT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhhhh are you sure? If it's expired then how did we get to do_backward? Isn't it the PyFunction that calls this method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also just call _do_backward straight up. So the order of operations is, you save out the grad_fn, free the variables, and then call grad_fn._do_backward. It's dumb, not worth making it work better, but we do have an error check if this does happen :)

"was called. This could occur if you manually called _do_backward on Function. "
"In any case, this is very naughty! If you absolutely need this to work, "
"try porting your code to use non-legacy autograd function, see: "
"https://pytorch.org/docs/stable/notes/extending.html#extending-torch-autograd");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh lol, I think this is a bit paranoid, but w/e

self->cdata.add_pre_hook(std::move(hook));
auto cdata = self->cdata.lock();
TORCH_CHECK(cdata,
"Legacy autograd function had _register_hook_dict called before the function was "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very bad error message to tell the user. Please figure out which methods call this one and write something that doesn't mention internal APIs. This gives completely no guidance about the potential fixes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand it's legacy, so I guess I don't really care.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it's easy enough to fix.

// This is really a true assert, because we've already tested for the
// self->has_freed_buffers case at the beginning of this function:
// buffers are freed when PyFunction dies; if the buffers are not freed,
// PyFunction must be live.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm are you sure? I think those are orthogonal conditions. has_freed_buffers is only triggered by running backward, but you can have cdata get freed with no backward runs involved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, it's a superset condition. If PyFunction is destructed, has_freed_buffers is guaranteed to be true, but not vice versa. That's good enough here.

Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#22983 Invert ownership between PyFunction and THPFunction.**

Fixes #1653, #14960 and #18619.

Whence this fix? Let's introduce the dramatis personae:

* THPFunction - a PyObject that represents a ctx object that is passed to our custom Python autograd functions. It contains a PyFunction
* PyFunction - a subclass of Function which says how to compute the actual backward function. Note that this is a closure in the sense that it is associated with a ctx object which actually contains the saved tensors. It contains a non-owning pointer to THPFunction

OK, do you see something wrong with this situation? The ownership here is totally backwards! It is PyFunction that should own the THPFunction, because a PyFunction is a closure that should own its context. To put it in other words, it is OK if a PyFunction gets freed before THPFunction, but not vice versa.

This sounds nice, but making this happen in practice is a bit trickier, because sometimes we use THPFunction to represent the PyFunction. In particular, we represent a `grad_fn` object in the Python binding level as the THPFunction (I can only imagine we did this because we wanted the object identity of `grad_fn` to line up with the class the user provided).  I did not have time to untangle this situation, so for this patch, in the cases where we're improperly treating a THPFunction as a PyFunction, we simply assume that the PyFunction is still live (it usually is), and yell at the user if it's not.

I constructed this patch in a few steps:
1. First, I flipped the ownership between PyFunction and THPFunction, but maintained a weak pointer from
THPFunction to PyFunction so that I could have an escape hatch when the existing code improperly treated THPFunction as an owning reference to PyFunction. In some cases your code might break because of this, but most of the time the PyFunction is live and everything works out (intuitively, this makes sense, since the ctx object should only really stay live as long as you're actually going to execute the backwards, which will keep the PyFunction live).
2. There are some cases where PyFunction is *provably* live at some point in time. In that case, don't bother extracting a strong reference from the weak pointer, just pass the strong pointer along directly!
3. For all the other cases where we can't prove it, write a long error message to help the user out when they run into that situation.

Some subtleties:
- PyFunction is a C++ object that refers to a PyObject.  This means it needs
  a custom deleter to handle deleting the PyObject, since you can't assume
  you have the GIL when it dies.
- The old test_gc_in_destructor failed our internal assert because we never
  actually ran a backwards, and thus never actually materialized PyFunction.
  I'm chalking this up as "misuse of API" and rewrote the test to not have
  this problem.
- Generalizing the issue above: this is BC-breaking for legacy autograd Functions
  in the following way: previously, the PyFunction for a THPFunction was generated
  immediately on construction of the Function `f` (which would eventually serve as
  the `grad_fn` of an object when you finally call `f(x)`.) This means that
  operations like `f.register_hook()` which previously worked prior to `f(x)` no longer
  work; you have to wait until you actually call the function to do registration.

I also removed dead Function::get_shared_ptr and Decref struct; these were previously needed for the crazy split refcounting scheme we had previously, but now are dead code.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
ezyang added a commit that referenced this pull request Jul 22, 2019
Fixes #16532 and #14960.

This patch is a massive hack.  The way I constructed it was I flipped the
ownership between PyFunction and THPFunction, but maintained a weak pointer from
THPFunction to PyFunction so all existing code works.  Essentially, this patch
assumes that PyFunction stays live as long as you have a THPFunction: intuitively,
this makes sense, since the ctx object should only really stay live as long as
you're actually going to execute the backwards, which will keep the PyFunction
live.  But as you can see from the presently skipped tests (specifically,
test_hook_none), this is not always true.  But it seems to be true for the code
we care about, and that's enough for me!

Some subtleties:
- PyFunction is a C++ object that refers to a PyObject.  This means it needs
  a custom deleter to handle deleting the PyObject, since you can't assume
  you have the GIL when it dies.
- The old test_gc_in_destructor failed our internal assert because we never
  actually ran a backwards, and thus never actually materialized PyFunction.
  I'm chalking this up as "misuse of API" and rewrote the test to not have
  this problem.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

ghstack-source-id: 75917fb
Pull Request resolved: #22983
@zou3519 zou3519 deleted the gh/ezyang/237/head branch July 22, 2019 21:16
@facebook-github-bot
Copy link
Contributor

@ezyang merged this pull request in fdfc676.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: autograd Related to torch.autograd, and the autograd engine in general module: nn Related to torch.nn module: pybind Related to our Python bindings / interactions with other Python libraries

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants