Skip to content

Conversation

@xionghuaidong
Copy link

Fixes #45558

This assertion failure is caused by the incorrect implementation of aten::set_grad_enabled in torch/csrc/jit/runtime/register_special_ops.cpp. The current implementation is:

Operator(
    "aten::set_grad_enabled(bool val) -> ()",
    [](Stack* stack) {
      torch::GradMode::set_enabled(pop(stack).toBool());
      push(stack, IValue());
    },
    aliasAnalysisConservative()),

which push a None on to the evaluation stack after calling set_enabled. But according to the signature, the behavior is incorrect as the signature says this function won't return a value. I guess the original author might be confused by the behavior of Python, which pushes a None on to the evaluation stack when the function definition does not end with a return statement with an explicit result value.

If aten::set_grad_enabled pushes a None on to the evaluation stack, each time it's called, the evaluation stack will accumulate an extra None. In our case, with torch.no_grad(): will cause aten::set_grad_enabled to be called twice, so when the forward method finishes, the evaluation stack will be [None, None, Tensor]. But the return statement of GraphFunction::operator() in torch/csrc/jit/api/function_impl.cpp is return stack.front(); which will try to extract a tensor out of a None thus causes the assertion failure.

The solution is simple, just remove the push in the implementation of aten::set_grad_enabled.

@facebook-github-bot facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Sep 30, 2020
@dr-ci
Copy link

dr-ci bot commented Sep 30, 2020

💊 CI failures summary and remediations

As of commit 4feb8fc (more details on the Dr. CI page):



❄️ 2 failures tentatively classified as flaky

but reruns have not yet been triggered to confirm:

See CircleCI build pytorch_macos_10_13_py3_build (1/2)

Step: "Update Homebrew" (full log | diagnosis details | 🔁 rerun) ❄️

fatal: Could not read from remote repository.
Receiving objects:  98% (175/178) Receiving objects:  99% (177/178) Receiving objects: 100% (178/178) Receiving objects: 100% (178/178), 63.90 KiB | 10.65 MiB/s, done. 
Resolving deltas:  96% (89/92) Resolving deltas:  97% (90/92) Resolving deltas: 100% (92/92) Resolving deltas: 100% (92/92), completed with 85 local objects. 
From ssh://github.com/Homebrew/homebrew-cask-versions 
 + 15f6b44...90ed6b8 master     -> origin/master  (forced update) 
+ git reset --hard origin/master 
HEAD is now at 90ed6b8 Update microsoft-edge-beta from 86.0.622.19 to 86.0.622.28 (#9686) 
+ for path in '$(find /usr/local/Homebrew -type d -name .git)' 
+ cd /usr/local/Homebrew/Library/Taps/homebrew/homebrew-core/.git/.. 
+ git fetch --depth=1 origin 
Connection to github.com closed by remote host.  
fatal: Could not read from remote repository. 
 
Please make sure you have the correct access rights 
and the repository exists. 

See CircleCI build pytorch_ios_11_2_1_x86_64_build (2/2)

Step: "Update Homebrew" (full log | diagnosis details | 🔁 rerun) ❄️

fatal: Could not read from remote repository.
Receiving objects:  98% (175/178) Receiving objects:  99% (177/178) Receiving objects: 100% (178/178) Receiving objects: 100% (178/178), 63.90 KiB | 10.65 MiB/s, done. 
Resolving deltas:  96% (89/92) Resolving deltas:  97% (90/92) Resolving deltas: 100% (92/92) Resolving deltas: 100% (92/92), completed with 85 local objects. 
From ssh://github.com/Homebrew/homebrew-cask-versions 
 + 15f6b44...90ed6b8 master     -> origin/master  (forced update) 
+ git reset --hard origin/master 
HEAD is now at 90ed6b8 Update microsoft-edge-beta from 86.0.622.19 to 86.0.622.28 (#9686) 
+ for path in '$(find /usr/local/Homebrew -type d -name .git)' 
+ cd /usr/local/Homebrew/Library/Taps/homebrew/homebrew-core/.git/.. 
+ git fetch --depth=1 origin 
Connection to github.com closed by remote host.  
fatal: Could not read from remote repository. 
 
Please make sure you have the correct access rights 
and the repository exists. 

Extra GitHub checks: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 1 time.

@mrshenli mrshenli added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 2, 2020
@mrshenli mrshenli requested a review from suo October 2, 2020 00:07
@SplitInfinity SplitInfinity requested review from SplitInfinity and removed request for suo October 6, 2020 17:20
"aten::set_grad_enabled(bool val) -> ()",
[](Stack* stack) {
torch::GradMode::set_enabled(pop(stack).toBool());
push(stack, IValue());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eellison I remember we talked about this during the no_grad PR:
Captura de Pantalla 2020-10-06 a la(s) 10 17 36 a  m

I agree with the PR author's assessment that what we thought is the return value of torch.set_grad_enabled(False) is actually just Python printing out none where there is no return value but I wanted to double-check.

Copy link
Contributor

@eellison eellison Oct 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, my suggestion was you push None on the stack but we should have also updated the schema to reflect that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No return value is also fine. I want to double check that the operator doesn't have a return value though when you annotate it with (). (printing the IR graph should show us)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed

graph():
  %2 : None = prim::Constant() # test/elias.py:5:0
  %0 : bool = prim::Constant[value=1]() # test/elias.py:6:27
   = aten::set_grad_enabled(%0) # test/elias.py:6:4
  return (%2)

Copy link
Contributor

@eellison eellison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thx for PR! It would be nice if we could have caught this automatically - maybe a debug build that asserts the stack is the expected size after each operator. This isnt the first time we have an extra stack value....

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SplitInfinity has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@xionghuaidong
Copy link
Author

LGTM! Thx for PR! It would be nice if we could have caught this automatically - maybe a debug build that asserts the stack is the expected size after each operator. This isnt the first time we have an extra stack value....

Checking stack size for each operator is a bit heavy, a debug build is appropriate. We can also add an always-on check in GraphFunction::operator() in torch/csrc/jit/api/function_impl.cpp which throws an exception if the stack size is not 1. This check could catch this bug.

SplitInfinity pushed a commit that referenced this pull request Oct 8, 2020
…45559)

Summary:
Fixes #45558

This assertion failure is caused by the incorrect implementation of ``aten::set_grad_enabled`` in [torch/csrc/jit/runtime/register_special_ops.cpp](https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/runtime/register_special_ops.cpp#L436). The current implementation is:

```cpp
Operator(
    "aten::set_grad_enabled(bool val) -> ()",
    [](Stack* stack) {
      torch::GradMode::set_enabled(pop(stack).toBool());
      push(stack, IValue());
    },
    aliasAnalysisConservative()),
```

which push a ``None`` on to the evaluation stack after calling ``set_enabled``. But according to the signature, the behavior is incorrect as the signature says this function won't return a value. I guess the original author might be confused by the behavior of Python, which pushes a ``None`` on to the evaluation stack when the function definition does not end with a return statement with an explicit result value.

If ``aten::set_grad_enabled`` pushes a ``None`` on to the evaluation stack, each time it's called, the evaluation stack will accumulate an extra ``None``. In our case, ``with torch.no_grad():`` will cause ``aten::set_grad_enabled`` to be called twice, so when the ``forward`` method finishes, the evaluation stack will be ``[None, None, Tensor]``. But the return statement of ``GraphFunction::operator()`` in [torch/csrc/jit/api/function_impl.cpp](https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/api/function_impl.cpp#L51) is ``return stack.front();`` which will try to extract a tensor out of a ``None`` thus causes the assertion failure.

The solution is simple, just remove the push in the implementation of ``aten::set_grad_enabled``.

Pull Request resolved: #45559

Reviewed By: albanD

Differential Revision: D24142153

Pulled By: SplitInfinity

fbshipit-source-id: 75aad0e38bd912a437f7e1a1ee89ab4445e35b5d
@facebook-github-bot
Copy link
Contributor

@SplitInfinity merged this pull request in e3112e3.

malfet pushed a commit that referenced this pull request Oct 12, 2020
…45559) (#46060)

Summary:
Fixes #45558

This assertion failure is caused by the incorrect implementation of ``aten::set_grad_enabled`` in [torch/csrc/jit/runtime/register_special_ops.cpp](https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/runtime/register_special_ops.cpp#L436). The current implementation is:

```cpp
Operator(
    "aten::set_grad_enabled(bool val) -> ()",
    [](Stack* stack) {
      torch::GradMode::set_enabled(pop(stack).toBool());
      push(stack, IValue());
    },
    aliasAnalysisConservative()),
```

which push a ``None`` on to the evaluation stack after calling ``set_enabled``. But according to the signature, the behavior is incorrect as the signature says this function won't return a value. I guess the original author might be confused by the behavior of Python, which pushes a ``None`` on to the evaluation stack when the function definition does not end with a return statement with an explicit result value.

If ``aten::set_grad_enabled`` pushes a ``None`` on to the evaluation stack, each time it's called, the evaluation stack will accumulate an extra ``None``. In our case, ``with torch.no_grad():`` will cause ``aten::set_grad_enabled`` to be called twice, so when the ``forward`` method finishes, the evaluation stack will be ``[None, None, Tensor]``. But the return statement of ``GraphFunction::operator()`` in [torch/csrc/jit/api/function_impl.cpp](https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/api/function_impl.cpp#L51) is ``return stack.front();`` which will try to extract a tensor out of a ``None`` thus causes the assertion failure.

The solution is simple, just remove the push in the implementation of ``aten::set_grad_enabled``.

Pull Request resolved: #45559

Reviewed By: albanD

Differential Revision: D24142153

Pulled By: SplitInfinity

fbshipit-source-id: 75aad0e38bd912a437f7e1a1ee89ab4445e35b5d

Co-authored-by: huaidong.xiong <huaidong.xiong@mobvista.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged oncall: jit Add this issue/PR to JIT oncall triage queue open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[JIT] TorchScript runtime Expected Tensor but got None

6 participants