Skip to content

Conversation

@albanD
Copy link
Collaborator

@albanD albanD commented Mar 9, 2022

Minimal example that deadlocks before but not after:

import torch
from torch.autograd import Function

class Foo(Function):
    @staticmethod
    def forward(ctx, x):
        return x.clone()

    @staticmethod
    def forward(ctx, gO):
        return gO.clone()

def get_out():
    inp = torch.rand(2, requires_grad=True)

    # The python function is first so that it runs
    # last in the backward pass
    right = Foo.apply(inp)

    # An op that creates new memory
    left1 = inp.clone()
    # An op that saves its input
    left2 = left1 ** 2

    # Inplace modify so that the backward for
    # left2 always raises an error
    left1 += 1

    # An op that takes both side as input.
    # After running, both side's last op will be in
    # the ready queue
    # And the op for left will run first as it was
    # executed last during the forward
    out = left2 + right

    return out

# Nothing should be global variables here as, from what
# I can see, python leaks all the global objects
get_out().sum().backward()

Since this requires the python interpreter to die, it is hard to test in CI.
Let me know if you have an idea how to do it though.

@albanD albanD requested a review from ezyang March 9, 2022 14:38
@albanD albanD requested a review from soulitzer as a code owner March 9, 2022 14:38
@pytorch-bot
Copy link

pytorch-bot bot commented Mar 9, 2022

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/albanD/pytorch/blob/98b241ec4aadecf4649c8bcbf9609f6e3c11a115/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default
Add ciflow labels to this PR to trigger more builds:

Workflows Labels (bold enabled) Status
Triggered Workflows
linux-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
linux-binary-libtorch-cxx11-abi ciflow/all, ciflow/binaries, ciflow/binaries_libtorch, ciflow/default, ciflow/trunk ✅ triggered
linux-binary-libtorch-pre-cxx11 ciflow/all, ciflow/binaries, ciflow/binaries_libtorch, ciflow/default, ciflow/trunk ✅ triggered
linux-binary-manywheel ciflow/all, ciflow/binaries, ciflow/binaries_wheel, ciflow/default, ciflow/trunk ✅ triggered
linux-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/trunk ✅ triggered
linux-bionic-rocm4.5-py3.7 ciflow/all, ciflow/default, ciflow/linux, ciflow/rocm, ciflow/trunk ✅ triggered
linux-docs ciflow/all, ciflow/cpu, ciflow/default, ciflow/docs, ciflow/linux, ciflow/trunk ✅ triggered
linux-vulkan-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk, ciflow/vulkan ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7-bazel-test ciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-build ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-asan ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-onnx ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc5.4-mobile-lightweight-dispatch-build ciflow/all, ciflow/cpu, ciflow/default, ciflow/libtorch, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7-no-ops ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
macos-arm64-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
macos-arm64-binary-wheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
macos-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
macos-binary-libtorch-cxx11-abi ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
macos-binary-libtorch-pre-cxx11 ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
macos-binary-wheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
win-vs2019-cpu-py3 ciflow/all, ciflow/cpu, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
win-vs2019-cuda11.3-py3 ciflow/all, ciflow/cuda, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
windows-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
windows-binary-libtorch-debug ciflow/all, ciflow/binaries, ciflow/binaries_libtorch, ciflow/default, ciflow/trunk ✅ triggered
windows-binary-libtorch-release ciflow/all, ciflow/binaries, ciflow/binaries_libtorch, ciflow/default, ciflow/trunk ✅ triggered
windows-binary-wheel ciflow/all, ciflow/binaries, ciflow/binaries_wheel, ciflow/default, ciflow/trunk ✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
docker-builds ciflow/all, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-arm64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-arm64-custom-ops ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-arm64-metal ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-x86-64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow, ciflow/trunk 🚫 skipped
linux-docs-push ciflow/all, ciflow/cpu, ciflow/linux, ciflow/scheduled 🚫 skipped
linux-xenial-cuda11.3-py3.7-gcc7-no-ops ciflow/all, ciflow/cuda, ciflow/linux, ciflow/trunk 🚫 skipped
macos-10-15-py3-arm64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-11-py3-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
parallelnative-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck 🚫 skipped
periodic-linux-xenial-cuda11.3-py3.7-gcc7-debug ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-win-vs2019-cuda11.5-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build ciflow/all, ciflow/android, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
pytorch-xla-linux-bionic-py3.7-clang8 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk, ciflow/xla 🚫 skipped

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Mar 9, 2022

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 5aa695e (more details on the Dr. CI page):


  • 1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See GitHub Actions build linux-xenial-py3.7-gcc5.4 / test (backwards_compat, 1, 1, linux.2xlarge) (1/1)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-03-09T15:37:08.0035110Z RuntimeError:
2022-03-09T15:37:07.3334290Z Author: PyTorch Team
2022-03-09T15:37:07.3334698Z Author-email: packages@pytorch.org
2022-03-09T15:37:07.3335080Z License: BSD-3
2022-03-09T15:37:07.3335490Z Location: /opt/conda/lib/python3.7/site-packages
2022-03-09T15:37:07.3335747Z Requires: typing-extensions
2022-03-09T15:37:07.3335940Z Required-by: 
2022-03-09T15:37:07.3666307Z + python check_forward_backward_compatibility.py --existing-schemas nightly_schemas.txt
2022-03-09T15:37:08.0033999Z Traceback (most recent call last):
2022-03-09T15:37:08.0034482Z   File "check_forward_backward_compatibility.py", line 308, in <module>
2022-03-09T15:37:08.0034930Z     s = parse_schema(line.strip())
2022-03-09T15:37:08.0035110Z RuntimeError: 
2022-03-09T15:37:08.0035367Z Unknown custom class type profiler._RecordFunction. Please ensure it is registered.:
2022-03-09T15:37:08.0036214Z profiler::_call_end_callbacks_on_jit_fut._RecordFunction(__torch__.torch.classes.profiler._RecordFunction x, Future(t) y) -> (Future(t))
2022-03-09T15:37:08.0036621Z                                                                                           ~~~~~~~~~~~~~~~ <--- HERE
2022-03-09T15:37:08.0036755Z 
2022-03-09T15:37:08.0808376Z + cleanup
2022-03-09T15:37:08.0809022Z + retcode=1
2022-03-09T15:37:08.0809201Z + set +x
2022-03-09T15:37:08.0846553Z ##[error]Process completed with exit code 1.
2022-03-09T15:37:08.0875511Z ##[group]Run # Ensure the working directory gets chowned back to the current user
2022-03-09T15:37:08.0875837Z �[36;1m# Ensure the working directory gets chowned back to the current user�[0m

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot
Copy link
Contributor

@albanD has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@ezyang
Copy link
Contributor

ezyang commented Mar 9, 2022

Since this requires the python interpreter to die, it is hard to test in CI.
Let me know if you have an idea how to do it though.

Spawn a fresh Python interpreter for the test. There are a few example of this e.g. test_cublas_config_nondeterministic_alert or test/test_logging.py. Make sure timeouts are working correctly.

@ezyang
Copy link
Contributor

ezyang commented Mar 9, 2022

We should audit the rest of the destructors to check that they don't unconditionally grab the GIL. It is basically never right to unconditionally get the GIL from a destructor.

@albanD
Copy link
Collaborator Author

albanD commented Mar 9, 2022

We should audit the rest of the destructors to check that they don't unconditionally grab the GIL. It is basically never right to unconditionally get the GIL from a destructor.

That was my idea in the original PR (at least for the autograd object's destructors). But this one is called directly by the custom deleter that we set for each shared_ptr we create that contains a Node. So I completely missed it.

@facebook-github-bot
Copy link
Contributor

@albanD has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@atuljangra
Copy link
Contributor

Thanks @albanD and @ezyang for the quick turnaround with this deadlock!

facebook-github-bot pushed a commit that referenced this pull request Mar 9, 2022
Summary:
Minimal example that deadlocks before but not after:
```python
import torch
from torch.autograd import Function

class Foo(Function):
    staticmethod
    def forward(ctx, x):
        return x.clone()

    staticmethod
    def forward(ctx, gO):
        return gO.clone()

def get_out():
    inp = torch.rand(2, requires_grad=True)

    # The python function is first so that it runs
    # last in the backward pass
    right = Foo.apply(inp)

    # An op that creates new memory
    left1 = inp.clone()
    # An op that saves its input
    left2 = left1 ** 2

    # Inplace modify so that the backward for
    # left2 always raises an error
    left1 += 1

    # An op that takes both side as input.
    # After running, both side's last op will be in
    # the ready queue
    # And the op for left will run first as it was
    # executed last during the forward
    out = left2 + right

    return out

# Nothing should be global variables here as, from what
# I can see, python leaks all the global objects
get_out().sum().backward()

```

Since this requires the python interpreter to die, it is hard to test in CI.
Let me know if you have an idea how to do it though.

Pull Request resolved: #73961

Reviewed By: malfet

Differential Revision: D34752747

Pulled By: albanD

fbshipit-source-id: 1a537b1f733e161e8d3ff053cd432b37b34d432a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants