Fix deadlock in some edge case in autograd #73961

albanD · 2022-03-09T14:38:47Z

Minimal example that deadlocks before but not after:

import torch
from torch.autograd import Function

class Foo(Function):
    @staticmethod
    def forward(ctx, x):
        return x.clone()

    @staticmethod
    def forward(ctx, gO):
        return gO.clone()

def get_out():
    inp = torch.rand(2, requires_grad=True)

    # The python function is first so that it runs
    # last in the backward pass
    right = Foo.apply(inp)

    # An op that creates new memory
    left1 = inp.clone()
    # An op that saves its input
    left2 = left1 ** 2

    # Inplace modify so that the backward for
    # left2 always raises an error
    left1 += 1

    # An op that takes both side as input.
    # After running, both side's last op will be in
    # the ready queue
    # And the op for left will run first as it was
    # executed last during the forward
    out = left2 + right

    return out

# Nothing should be global variables here as, from what
# I can see, python leaks all the global objects
get_out().sum().backward()

Since this requires the python interpreter to die, it is hard to test in CI.
Let me know if you have an idea how to do it though.

pytorch-bot · 2022-03-09T14:38:51Z

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/albanD/pytorch/blob/98b241ec4aadecf4649c8bcbf9609f6e3c11a115/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default
Add ciflow labels to this PR to trigger more builds:

Workflows	Labels (bold enabled)	Status
Triggered Workflows
linux-binary-conda	`ciflow/binaries`, `ciflow/binaries_conda`, `ciflow/default`	✅ triggered
linux-binary-libtorch-cxx11-abi	`ciflow/all`, `ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`, `ciflow/trunk`	✅ triggered
linux-binary-libtorch-pre-cxx11	`ciflow/all`, `ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`, `ciflow/trunk`	✅ triggered
linux-binary-manywheel	`ciflow/all`, `ciflow/binaries`, `ciflow/binaries_wheel`, `ciflow/default`, `ciflow/trunk`	✅ triggered
linux-bionic-py3.7-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/noarch`, `ciflow/trunk`	✅ triggered
linux-bionic-rocm4.5-py3.7	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/rocm`, `ciflow/trunk`	✅ triggered
linux-docs	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/docs`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-vulkan-bionic-py3.7-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`, `ciflow/vulkan`	✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7-bazel-test	`ciflow/all`, `ciflow/bazel`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3-clang5-mobile-build	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`, `ciflow/trunk`	✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-clang7-asan	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/sanitizers`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-clang7-onnx	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/onnx`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/mobile`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc7	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc7-no-ops	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
macos-arm64-binary-conda	`ciflow/binaries`, `ciflow/binaries_conda`, `ciflow/default`	✅ triggered
macos-arm64-binary-wheel	`ciflow/binaries`, `ciflow/binaries_wheel`, `ciflow/default`	✅ triggered
macos-binary-conda	`ciflow/binaries`, `ciflow/binaries_conda`, `ciflow/default`	✅ triggered
macos-binary-libtorch-cxx11-abi	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	✅ triggered
macos-binary-libtorch-pre-cxx11	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	✅ triggered
macos-binary-wheel	`ciflow/binaries`, `ciflow/binaries_wheel`, `ciflow/default`	✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
win-vs2019-cpu-py3	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/trunk`, `ciflow/win`	✅ triggered
win-vs2019-cuda11.3-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/trunk`, `ciflow/win`	✅ triggered
windows-binary-conda	`ciflow/binaries`, `ciflow/binaries_conda`, `ciflow/default`	✅ triggered
windows-binary-libtorch-debug	`ciflow/all`, `ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`, `ciflow/trunk`	✅ triggered
windows-binary-libtorch-release	`ciflow/all`, `ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`, `ciflow/trunk`	✅ triggered
windows-binary-wheel	`ciflow/all`, `ciflow/binaries`, `ciflow/binaries_wheel`, `ciflow/default`, `ciflow/trunk`	✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
docker-builds	`ciflow/all`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/scheduled`	🚫 skipped
ios-12-5-1-arm64-coreml	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/scheduled`	🚫 skipped
ios-12-5-1-arm64-custom-ops	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/scheduled`	🚫 skipped
ios-12-5-1-arm64-metal	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/scheduled`	🚫 skipped
ios-12-5-1-x86-64	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-x86-64-coreml	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/slow`, `ciflow/trunk`	🚫 skipped
linux-docs-push	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
linux-xenial-cuda11.3-py3.7-gcc7-no-ops	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
macos-10-15-py3-arm64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
macos-11-py3-x86-64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
parallelnative-linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-linux-bionic-cuda11.5-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`, `ciflow/slow`, `ciflow/slow-gradcheck`	🚫 skipped
periodic-linux-xenial-cuda11.3-py3.7-gcc7-debug	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-win-vs2019-cuda11.5-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/scheduled`, `ciflow/win`	🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
pytorch-xla-linux-bionic-py3.7-clang8	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`, `ciflow/xla`	🚫 skipped

facebook-github-bot · 2022-03-09T14:38:53Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/73961
📄 Preview docs built from this PR
📄 Preview C++ docs built from this PR
↩️ [fb-only] Re-run with SSH instructions
🔧 Opt-in to CIFlow to control what jobs run on your PRs

💊 CI failures summary and remediations

As of commit 5aa695e (more details on the Dr. CI page):

1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

linux-xenial-py3.7-gcc5.4 / test (backwards_compat, 1, 1, linux.2xlarge) (1/1)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-03-09T15:37:08.0035110Z RuntimeError:

2022-03-09T15:37:07.3334290Z Author: PyTorch Team
2022-03-09T15:37:07.3334698Z Author-email: packages@pytorch.org
2022-03-09T15:37:07.3335080Z License: BSD-3
2022-03-09T15:37:07.3335490Z Location: /opt/conda/lib/python3.7/site-packages
2022-03-09T15:37:07.3335747Z Requires: typing-extensions
2022-03-09T15:37:07.3335940Z Required-by: 
2022-03-09T15:37:07.3666307Z + python check_forward_backward_compatibility.py --existing-schemas nightly_schemas.txt
2022-03-09T15:37:08.0033999Z Traceback (most recent call last):
2022-03-09T15:37:08.0034482Z   File "check_forward_backward_compatibility.py", line 308, in <module>
2022-03-09T15:37:08.0034930Z     s = parse_schema(line.strip())
2022-03-09T15:37:08.0035110Z RuntimeError: 
2022-03-09T15:37:08.0035367Z Unknown custom class type profiler._RecordFunction. Please ensure it is registered.:
2022-03-09T15:37:08.0036214Z profiler::_call_end_callbacks_on_jit_fut._RecordFunction(__torch__.torch.classes.profiler._RecordFunction x, Future(t) y) -> (Future(t))
2022-03-09T15:37:08.0036621Z                                                                                           ~~~~~~~~~~~~~~~ <--- HERE
2022-03-09T15:37:08.0036755Z 
2022-03-09T15:37:08.0808376Z + cleanup
2022-03-09T15:37:08.0809022Z + retcode=1
2022-03-09T15:37:08.0809201Z + set +x
2022-03-09T15:37:08.0846553Z ##[error]Process completed with exit code 1.
2022-03-09T15:37:08.0875511Z ##[group]Run # Ensure the working directory gets chowned back to the current user
2022-03-09T15:37:08.0875837Z �[36;1m# Ensure the working directory gets chowned back to the current user�[0m

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

facebook-github-bot · 2022-03-09T14:39:49Z

@albanD has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

torch/csrc/autograd/python_function.cpp

ezyang · 2022-03-09T14:44:03Z

Since this requires the python interpreter to die, it is hard to test in CI.
Let me know if you have an idea how to do it though.

Spawn a fresh Python interpreter for the test. There are a few example of this e.g. test_cublas_config_nondeterministic_alert or test/test_logging.py. Make sure timeouts are working correctly.

ezyang · 2022-03-09T14:45:20Z

We should audit the rest of the destructors to check that they don't unconditionally grab the GIL. It is basically never right to unconditionally get the GIL from a destructor.

albanD · 2022-03-09T14:47:59Z

We should audit the rest of the destructors to check that they don't unconditionally grab the GIL. It is basically never right to unconditionally get the GIL from a destructor.

That was my idea in the original PR (at least for the autograd object's destructors). But this one is called directly by the custom deleter that we set for each shared_ptr we create that contains a Node. So I completely missed it.

facebook-github-bot · 2022-03-09T15:51:27Z

@albanD has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

atuljangra · 2022-03-09T17:28:36Z

Thanks @albanD and @ezyang for the quick turnaround with this deadlock!

Summary: Minimal example that deadlocks before but not after: ```python import torch from torch.autograd import Function class Foo(Function): staticmethod def forward(ctx, x): return x.clone() staticmethod def forward(ctx, gO): return gO.clone() def get_out(): inp = torch.rand(2, requires_grad=True) # The python function is first so that it runs # last in the backward pass right = Foo.apply(inp) # An op that creates new memory left1 = inp.clone() # An op that saves its input left2 = left1 ** 2 # Inplace modify so that the backward for # left2 always raises an error left1 += 1 # An op that takes both side as input. # After running, both side's last op will be in # the ready queue # And the op for left will run first as it was # executed last during the forward out = left2 + right return out # Nothing should be global variables here as, from what # I can see, python leaks all the global objects get_out().sum().backward() ``` Since this requires the python interpreter to die, it is hard to test in CI. Let me know if you have an idea how to do it though. Pull Request resolved: #73961 Reviewed By: malfet Differential Revision: D34752747 Pulled By: albanD fbshipit-source-id: 1a537b1f733e161e8d3ff053cd432b37b34d432a

Fix deadlock in some edge case in autograd

98b241e

albanD requested a review from ezyang March 9, 2022 14:38

albanD requested a review from soulitzer as a code owner March 9, 2022 14:38

pytorch-bot bot added the ciflow/default label Mar 9, 2022

facebook-github-bot added the cla signed label Mar 9, 2022

ezyang reviewed Mar 9, 2022

View reviewed changes

torch/csrc/autograd/python_function.cpp Outdated Show resolved Hide resolved

typo

d10bcac

ezyang approved these changes Mar 9, 2022

View reviewed changes

albanD added 2 commits March 9, 2022 10:11

Add test. Tested locally that the timeout does work

ccce6b4

flake8

5aa695e

pytorchmergebot closed this in b2a5507 Mar 9, 2022

albanD mentioned this pull request Mar 9, 2022

Release note bot is not sending messages on PRs anymore #73985

Open

WBobby mentioned this pull request Aug 17, 2022

Add ROCm5.2.3/AMDGPU support for PyTorch WBobby/pytorch#2

Closed

albanD mentioned this pull request Nov 2, 2022

MAINT: Ensure raw dlpack deleter works when called without the GIL numpy/numpy#22510

Merged

soulitzer mentioned this pull request Feb 21, 2024

DISABLED test_pynode_destruction_deadlock (__main__.TestAutograd) #119886

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix deadlock in some edge case in autograd #73961

Fix deadlock in some edge case in autograd #73961

Uh oh!

albanD commented Mar 9, 2022

Uh oh!

pytorch-bot bot commented Mar 9, 2022

⚛️ CI Flow

Uh oh!

facebook-github-bot commented Mar 9, 2022 •

edited

Loading

Uh oh!

facebook-github-bot commented Mar 9, 2022

Uh oh!

Uh oh!

ezyang commented Mar 9, 2022

Uh oh!

ezyang commented Mar 9, 2022

Uh oh!

albanD commented Mar 9, 2022

Uh oh!

facebook-github-bot commented Mar 9, 2022

Uh oh!

atuljangra commented Mar 9, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix deadlock in some edge case in autograd #73961

Fix deadlock in some edge case in autograd #73961

Uh oh!

Conversation

albanD commented Mar 9, 2022

Uh oh!

pytorch-bot bot commented Mar 9, 2022

⚛️ CI Flow

Uh oh!

facebook-github-bot commented Mar 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

💊 CI failures summary and remediations

🕵️ 1 new failure recognized by patterns

linux-xenial-py3.7-gcc5.4 / test (backwards_compat, 1, 1, linux.2xlarge) (1/1)

Uh oh!

facebook-github-bot commented Mar 9, 2022

Uh oh!

Uh oh!

ezyang commented Mar 9, 2022

Uh oh!

ezyang commented Mar 9, 2022

Uh oh!

albanD commented Mar 9, 2022

Uh oh!

facebook-github-bot commented Mar 9, 2022

Uh oh!

atuljangra commented Mar 9, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

facebook-github-bot commented Mar 9, 2022 •

edited

Loading