-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Fix deadlock in some edge case in autograd #73961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
CI Flow Status⚛️ CI FlowRuleset - Version:
|
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit 5aa695e (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
|
@albanD has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Spawn a fresh Python interpreter for the test. There are a few example of this e.g. test_cublas_config_nondeterministic_alert or |
|
We should audit the rest of the destructors to check that they don't unconditionally grab the GIL. It is basically never right to unconditionally get the GIL from a destructor. |
That was my idea in the original PR (at least for the autograd object's destructors). But this one is called directly by the custom deleter that we set for each shared_ptr we create that contains a Node. So I completely missed it. |
|
@albanD has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary:
Minimal example that deadlocks before but not after:
```python
import torch
from torch.autograd import Function
class Foo(Function):
staticmethod
def forward(ctx, x):
return x.clone()
staticmethod
def forward(ctx, gO):
return gO.clone()
def get_out():
inp = torch.rand(2, requires_grad=True)
# The python function is first so that it runs
# last in the backward pass
right = Foo.apply(inp)
# An op that creates new memory
left1 = inp.clone()
# An op that saves its input
left2 = left1 ** 2
# Inplace modify so that the backward for
# left2 always raises an error
left1 += 1
# An op that takes both side as input.
# After running, both side's last op will be in
# the ready queue
# And the op for left will run first as it was
# executed last during the forward
out = left2 + right
return out
# Nothing should be global variables here as, from what
# I can see, python leaks all the global objects
get_out().sum().backward()
```
Since this requires the python interpreter to die, it is hard to test in CI.
Let me know if you have an idea how to do it though.
Pull Request resolved: #73961
Reviewed By: malfet
Differential Revision: D34752747
Pulled By: albanD
fbshipit-source-id: 1a537b1f733e161e8d3ff053cd432b37b34d432a
Minimal example that deadlocks before but not after:
Since this requires the python interpreter to die, it is hard to test in CI.
Let me know if you have an idea how to do it though.