Skip to content

Conversation

@Varal7
Copy link
Contributor

@Varal7 Varal7 commented Jun 23, 2021

Stack from ghstack:

Summary: This PR is part of a stack aimed to address #58512.
Provide bindings for the SavedVariable class (for custom functions only). The SavedTensor object can not be created from Python and currently does not expose any method.
Undefined behavior can arise if the user decides to keep a reference to this SavedTensor after the underlying SavedVariable has been freed.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: D29466228

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jun 23, 2021

💊 CI failures summary and remediations

As of commit a7ccff0 (more details on the Dr. CI page and at hud.pytorch.org/pr/60551):


  • 3/3 failures possibly* introduced in this PR
    • 1/3 non-scanned failure(s)

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_build (1/2)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

Jun 29 20:29:46 ModuleNotFoundError: No module named 'expecttest'
Jun 29 20:29:45 + [[ -n '' ]]
Jun 29 20:29:45 + [[ pytorch-linux-xenial-py3.6-gcc5.4-build == *xla* ]]
Jun 29 20:29:45 + [[ pytorch-linux-xenial-py3.6-gcc5.4-build != *libtorch* ]]
Jun 29 20:29:45 + [[ pytorch-linux-xenial-py3.6-gcc5.4-build != *bazel* ]]
Jun 29 20:29:45 + python test/run_test.py --export-past-test-times
Jun 29 20:29:46 Traceback (most recent call last):
Jun 29 20:29:46   File "test/run_test.py", line 18, in <module>
Jun 29 20:29:46     from torch.testing._internal.common_utils import FILE_SCHEMA, IS_IN_CI, TEST_WITH_ROCM, shell, set_cwd
Jun 29 20:29:46   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 48, in <module>
Jun 29 20:29:46     import expecttest
Jun 29 20:29:46 ModuleNotFoundError: No module named 'expecttest'
Jun 29 20:29:47 + cleanup
Jun 29 20:29:47 + retcode=1
Jun 29 20:29:47 + set +x
Jun 29 20:29:47 =================== sccache compilation log ===================
Jun 29 20:29:47 =========== If your build fails, please take a look at the log above for possible reasons ===========
Jun 29 20:29:47 Compile requests                    5065
Jun 29 20:29:47 Compile requests executed           4711
Jun 29 20:29:47 Cache hits                          4691
Jun 29 20:29:47 Cache hits (C/C++)                  4691
Jun 29 20:29:47 Cache misses                           2

See CircleCI build pytorch_xla_linux_bionic_py3_6_clang9_build (2/2)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

Jun 29 20:47:05 ModuleNotFoundError: No module named 'expecttest'
Jun 29 20:47:04 + [[ pytorch-xla-linux-bionic-py3.6-clang9-build != *xla* ]]
Jun 29 20:47:04 ~/workspace
Jun 29 20:47:04 + [[ pytorch-xla-linux-bionic-py3.6-clang9-build != *libtorch* ]]
Jun 29 20:47:04 + [[ pytorch-xla-linux-bionic-py3.6-clang9-build != *bazel* ]]
Jun 29 20:47:04 + python test/run_test.py --export-past-test-times
Jun 29 20:47:05 Traceback (most recent call last):
Jun 29 20:47:05   File "test/run_test.py", line 18, in <module>
Jun 29 20:47:05     from torch.testing._internal.common_utils import FILE_SCHEMA, IS_IN_CI, TEST_WITH_ROCM, shell, set_cwd
Jun 29 20:47:05   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 48, in <module>
Jun 29 20:47:05     import expecttest
Jun 29 20:47:05 ModuleNotFoundError: No module named 'expecttest'
Jun 29 20:47:05 + cleanup
Jun 29 20:47:05 + retcode=1
Jun 29 20:47:05 + set +x
Jun 29 20:47:05 =================== sccache compilation log ===================
Jun 29 20:47:05 =========== If your build fails, please take a look at the log above for possible reasons ===========
Jun 29 20:47:05 Compile requests                      0
Jun 29 20:47:05 Compile requests executed             0
Jun 29 20:47:05 Cache hits                            0
Jun 29 20:47:05 Cache misses                          0
Jun 29 20:47:05 Cache timeouts                        0

Preview docs built from this PR

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Varal7 added a commit that referenced this pull request Jun 23, 2021
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 366f1d3
Pull Request resolved: #60551
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
Copy link
Contributor

@soulitzer soulitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok, the main issue I think is how to deal with the case when backward already frees the SavedVariable. Maybe theres some way to do this in pybind, but if we use the plain old PyObject struct, we can have it hold a weak_ptr and be able to raise an error if c++ side is already freed

@Varal7
Copy link
Contributor Author

Varal7 commented Jun 23, 2021

Looks ok, the main issue I think is how to deal with the case when backward already frees the SavedVariable. Maybe theres some way to do this in pybind, but if we use the plain old PyObject struct, we can have it hold a weak_ptr and be able to raise an error if c++ side is already freed

Thanks for the review!
Yes the behavior after #60565 is that accessing the raw SavedVariable after it's freed will raise an error (https://github.com/pytorch/pytorch/pull/60565/files#diff-1fdfd061a92957fd11edbaceb538f7186f185b9fb22b7219107b007f7df557fbR750)
The error could be a tiny bit more explicit (in this case we know the user is explicitly trying to access the saved variable, and not trying to backward, but the current error message already mentions that it could be they are trying to access the saved variable).

I can rebase to include it in this PR.

@soulitzer
Copy link
Contributor

Np!
Ahh yes, that error is probably good to include as well in this PR for completeness.
The situation I was thinking about is slightly different, i.e., if the user calls raw_saved_tensors and saves a reference to a SavedVariable, then calls backward.

@albanD albanD removed their request for review June 23, 2021 22:21
@albanD
Copy link
Collaborator

albanD commented Jun 23, 2021

Removing myself as @soulitzer is handling this one.
You can add me back if you have a specific question for me!

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
@Varal7
Copy link
Contributor Author

Varal7 commented Jun 24, 2021

The situation I was thinking about is slightly different, i.e., if the user calls raw_saved_tensors and saves a reference to a SavedVariable, then calls backward.

Good point. My proposal is the following.
Because we only (and only plan to) expose the method SavedVariable::register_hooks, we can check there that the variable has not already been freed.
So if the user keeps a reference to the raw_saved_tensors after he has called backward he will just hold a useless object and will have a nice error message if he tries to call register_hook on it.
Like so: https://github.com/pytorch/pytorch/pull/60663/files#diff-cc9fba479b5beae06b2eea2e390d17796e0341c5b037a20b5bcaccbb0c341030R156

@Varal7 Varal7 requested a review from soulitzer June 24, 2021 17:43
@soulitzer
Copy link
Contributor

That's a good plan - but there are also some cases where the entire SavedVariable is destroyed no (not just its data field)?
For example all custom functions (we could add a test for this here), and some codegen'd functions store saved variable in a vector, and on release_variables, the vector is cleared.

@Varal7 Varal7 mentioned this pull request Jun 24, 2021
@Varal7
Copy link
Contributor Author

Varal7 commented Jun 24, 2021

While I do agree with you, I've written tests to test those cases (here and here).

What is confusing to me is that the following code does trigger RuntimeError: Calling register_hook on a saved tensor after it has been freed. as we would want.

class MyFn(Function):
    @staticmethod
    def forward(ctx, x):
        ctx.save_for_backward(x)
        return x

    @staticmethod
    def backward(ctx, g):
        return g, None

y = MyFn.apply(torch.ones(5, requires_grad=True))
t = y.grad_fn.raw_saved_tensors
y.sum().backward() # calls grad_fn->release_variables() and the destructors for each SavedVariable (but not `.reset()`)
t[0].register_hooks(lambda x: x, lambda x: x)

@soulitzer
Copy link
Contributor

It could be because vector.clear just updates its size, but capacity is required to stays the same, and thus the memory is still reserved by the vector.
Maybe we should tell the vector shrink_to_fit, to actually release the memory (though technically that is still not guaranteed).
Could be interesting as well to see if this improves memory usage too

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
Varal7 added a commit to Varal7/pytorch that referenced this pull request Jun 28, 2021
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 06d3801
Pull Request resolved: pytorch#60551
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
Copy link
Contributor

@soulitzer soulitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of small comments.

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
@Varal7
Copy link
Contributor Author

Varal7 commented Jun 29, 2021

@Varal7 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D29466228](https://our.internmc.facebook.com/intern/diff/D29466228)

[ghstack-poisoned]
@Varal7
Copy link
Contributor Author

Varal7 commented Jun 29, 2021

@Varal7 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in f54290f.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants