Apply saved tensor hooks #60975

Varal7 · 2021-06-29T19:15:47Z

Stack from ghstack:

Apply saved tensor hooks #60975

Summary: Fixes #58512
Uses the hooks introduced in #60663: upon registering, the pack hook is called and the returned python object is stored. From then on, whenever we need to unpack it, we will use that python object in combination with the unpack hook.
The packing can be done with gradient tracking disabled as we will add back the correct grad_fn during unpacking.
Inplace operations done by the pack_hook on the original tensor (in the case leaf || !output) will be caught if the Saved tensor is used by another op.
Inplace operations done by unpack_hook will unfortunately not be caught. We will add a warning in the docs that follows this PR.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: D29466227

Summary: Fixes #58512 Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

facebook-github-bot · 2021-06-29T19:15:54Z

💊 CI failures summary and remediations

As of commit 366ad39 (more details on the Dr. CI page and at hud.pytorch.org/pr/60975):

1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_macos_10_13_py3_test (1/1)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

Jul 17 00:10:37 test_remote_message_script_de...yUniqueId(created_on=0, local_id=0) to be created.

Jul 17 00:10:09 frame #12: std::__1::__function::__func<std::__1::__bind<torch::distributed::rpc::ProcessGroupAgent::enqueueRecv(torch::distributed::rpc::RecvWork)::$_6, torch::distributed::rpc::RecvWork>, std::__1::allocator<std::__1::__bind<torch::distributed::rpc::ProcessGroupAgent::enqueueRecv(torch::distributed::rpc::RecvWork)::$_6, torch::distributed::rpc::RecvWork> >, void ()>::operator()() + 42 (0x1171145ba in libtorch_cpu.dylib)
Jul 17 00:10:09 frame #13: c10::ThreadPool::main_loop(unsigned long) + 569 (0x10e7dd369 in libc10.dylib)
Jul 17 00:10:09 frame #14: void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, c10::ThreadPool::ThreadPool(int, int, std::__1::function<void ()>)::$_0> >(void*) + 67 (0x10e7dda13 in libc10.dylib)
Jul 17 00:10:09 frame #15: _pthread_start + 148 (0x7fff69698109 in libsystem_pthread.dylib)
Jul 17 00:10:09 frame #16: thread_start + 15 (0x7fff69693b8b in libsystem_pthread.dylib)
Jul 17 00:10:09 
Jul 17 00:10:09 ok (4.214s)
Jul 17 00:10:17   test_remote_message_dropped_pickle (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (8.438s)
Jul 17 00:10:26   test_remote_message_dropped_pickle_to_self (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (8.422s)
Jul 17 00:10:33   test_remote_message_script_delay_timeout (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (7.297s)
Jul 17 00:10:37   test_remote_message_script_delay_timeout_to_self (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... [E request_callback_no_python.cpp:555] Received error while processing request type 260: falseINTERNAL ASSERT FAILED at "../torch/csrc/distributed/rpc/rref_context.cpp":390, please report a bug to PyTorch. Expected OwnerRRef with id GloballyUniqueId(created_on=0, local_id=0) to be created.
Jul 17 00:10:37 Exception raised from getOwnerRRef at ../torch/csrc/distributed/rpc/rref_context.cpp:390 (most recent call first):
Jul 17 00:10:37 frame #0: c10::Error::Error(c10::SourceLocation, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >) + 98 (0x1103e86b2 in libc10.dylib)
Jul 17 00:10:37 frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 106 (0x1103e6e2a in libc10.dylib)
Jul 17 00:10:37 frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 64 (0x1103e7060 in libc10.dylib)
Jul 17 00:10:37 frame #3: torch::distributed::rpc::RRefContext::getOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, bool) + 1711 (0x11960eabf in libtorch_cpu.dylib)
Jul 17 00:10:37 frame #4: torch::distributed::rpc::RequestCallbackNoPython::assignOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, torch::distributed::rpc::GloballyUniqueId const&, c10::intrusive_ptr<c10::ivalue::Future, c10::detail::intrusive_target_default_null_type<c10::ivalue::Future> >) const + 86 (0x1195f9316 in libtorch_cpu.dylib)
Jul 17 00:10:37 frame #5: torch::distributed::rpc::RequestCallbackImpl::processScriptRemoteCall(torch::distributed::rpc::RpcCommandBase&, std::__1::vector<c10::Stream, std::__1::allocator<c10::Stream> >) const + 376 (0x10f9e4b58 in libtorch_python.dylib)
Jul 17 00:10:37 frame #6: torch::distributed::rpc::RequestCallbackNoPython::processRpc(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::__1::vector<c10::Stream, std::__1::allocator<c10::Stream> >) const + 437 (0x1195f7f65 in libtorch_cpu.dylib)
Jul 17 00:10:37 frame #7: torch::distributed::rpc::RequestCallbackImpl::processRpcWithErrors(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::__1::vector<c10::Stream, std::__1::allocator<c10::Stream> >) const + 74 (0x10f9e58ca in libtorch_python.dylib)
Jul 17 00:10:37 frame #8: c10::intrusive_ptr<c10::ivalue::Future, c10::detail::intrusive_target_default_null_type<c10::ivalue::Future> > c10::ivalue::Future::thenAsync<torch::distributed::rpc::RequestCallbackNoPython::processMessage(torch::distributed::rpc::Message&, std::__1::vector<c10::Stream, std::__1::allocator<c10::Stream> >) const::$_1>(torch::distributed::rpc::RequestCallbackNoPython::processMessage(torch::distributed::rpc::Message&, std::__1::vector<c10::Stream, std::__1::allocator<c10::Stream> >) const::$_1, std::__1::shared_ptr<c10::Type>)::'lambda'(c10::ivalue::Future&)::operator()(c10::ivalue::Future&) + 223 (0x1195ffc2f in libtorch_cpu.dylib)

Preview docs built from this PR

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Summary: Fixes #58512 Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 2417a47 Pull Request resolved: #60975

Varal7 · 2021-06-29T19:17:44Z

@Varal7 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: Fixes #58512 Test Plan: Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D29466227](https://our.internmc.facebook.com/intern/diff/D29466227) [ghstack-poisoned]

Varal7 · 2021-06-29T20:30:27Z

@Varal7 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: Fixes #58512 Test Plan: Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D29466227](https://our.internmc.facebook.com/intern/diff/D29466227) [ghstack-poisoned]

Summary: Fixes #58512 Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 384fccb Pull Request resolved: #60975

Varal7 · 2021-06-30T00:44:19Z

@Varal7 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: Fixes #58512 Test Plan: Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D29466227](https://our.internmc.facebook.com/intern/diff/D29466227) [ghstack-poisoned]

Varal7 · 2021-06-30T16:06:35Z

@Varal7 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: Fixes #58512 Test Plan: Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D29466227](https://our.internmc.facebook.com/intern/diff/D29466227) [ghstack-poisoned]

Summary: Fixes #58512 Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 2e6248d Pull Request resolved: #60975

Summary: Fixes #58512 Test Plan: Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D29466227](https://our.internmc.facebook.com/intern/diff/D29466227) [ghstack-poisoned]

Summary: Fixes #58512 Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: f3c33b6 Pull Request resolved: #60975

Summary: Fixes #58512 Test Plan: Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D29466227](https://our.internmc.facebook.com/intern/diff/D29466227) [ghstack-poisoned]

Fixes #58659. This PR builds directly on top of [#58512](#58512 (comment)). ~~Creates a context manager `with torch.autograd.graph.saved_tensors_default_hooks(pack, unpack)` that can be used on the Python side.~~ Expose a pair of functions to Python users: `torch.autograd.graph.set_saved_tensors_default_hooks(pack, unpack)` and `torch.autograd.graph.reset_saved_tensors_default_hooks()`. These functions control the hooks applied to saved tensors: all tensors *saved* in that context will be packed using the `pack` function, then unpacked accordingly when needed. Currently, this works by simply calling `register_hooks` (cf #60975) directly at the end of the constructor of a SavedVariable. This could be optimized further by not performing the copy before registering default hooks, but this would require a small refactor. Edit: the refactor is done in #61927. A current limitation is that if users create tensors in this context, they will not be able to register additional hooks on the saved tensor. For instance, to perform something like #28997, one could define a `pack` function that saves to disk whenever the tensor size is too big and returns a filename, then `unpack` simply reads the content of the file and outputs a tensor, e.g.: ```python def pack(x): name = os.path.join(tmp_dir, str(uuid.uuid4())) torch.save(x, name) return name def unpack(name): return torch.load(name) ``` Differential Revision: [D29792193](https://our.internmc.facebook.com/intern/diff/D29792193) [ghstack-poisoned]

Summary: Pull Request resolved: #61834 Expose a pair of functions to Python users: torch.autograd.graph.set_saved_tensors_default_hooks(pack, unpack) and torch.autograd.graph.reset_saved_tensors_default_hooks(). These functions control the hooks applied to saved tensors: all tensors saved in that context will be packed using the pack function, then unpacked accordingly when needed. Currently, this works by simply calling register_hooks (cf #60975) directly at the end of the constructor of a SavedVariable. This could be optimized further by not performing the copy before registering default hooks, but this would require a small refactor. Edit: the refactor is done in #61927. A current limitation is that if users create tensors in this context, they will not be able to register additional hooks on the saved tensor. For instance, to perform something like #28997, one could define a pack function that saves to disk whenever the tensor size is too big and returns a filename, then unpack simply reads the content of the file and outputs a tensor, e.g.: ``` def pack(x): name = os.path.join(tmp_dir, str(uuid.uuid4())) torch.save(x, name) return name def unpack(name): return torch.load(name) ``` Test Plan: Imported from OSS Reviewed By: zou3519 Differential Revision: D29792193 Pulled By: Varal7 fbshipit-source-id: 33e931230ef59faa3ec8b5d11ef7c05539bce77c

Add section to the Autograd mechanics docs to describe the recently exposed saved tensors (pytorch#52451), how to register packing / unpacking hooks (pytorch#60975) and how to use default hooks (pytorch#61834)

Add section to the Autograd mechanics docs to describe the recently exposed saved tensors (#52451), how to register packing / unpacking hooks (#60975) and how to use default hooks (#61834)

Summary: Expose a pair of functions to Python users: torch.autograd.graph.set_saved_tensors_default_hooks(pack, unpack) and torch.autograd.graph.reset_saved_tensors_default_hooks(). These functions control the hooks applied to saved tensors: all tensors saved in that context will be packed using the pack function, then unpacked accordingly when needed. Currently, this works by simply calling register_hooks (cf #60975) directly at the end of the constructor of a SavedVariable. This could be optimized further by not performing the copy before registering default hooks, but this would require a small refactor. Edit: the refactor is done in #61927. A current limitation is that if users create tensors in this context, they will not be able to register additional hooks on the saved tensor. For instance, to perform something like #28997, one could define a pack function that saves to disk whenever the tensor size is too big and returns a filename, then unpack simply reads the content of the file and outputs a tensor, e.g.: ``` def pack(x): name = os.path.join(tmp_dir, str(uuid.uuid4())) torch.save(x, name) return name def unpack(name): return torch.load(name) ``` Relanding previous PR: #61834 Original PR led to timeout error in: https://www.internalfb.com/mast/job/yuguo-release_canary_offline_training-inlinecvrp_a-canary_offline_train_28a7ecfc Now passing: https://www.internalfb.com/mast/job/quach-release_canary_offline_training-inlinecvrp_a-canary_offline_train_9bb57e98 The difference with the new version is we don't need to acquire the GIL when calling `PyDefaultSavedVariableHooks::get_hooks`. [ghstack-poisoned]

Summary: Pull Request resolved: #62563 Expose a pair of functions to Python users: torch.autograd.graph.set_saved_tensors_default_hooks(pack, unpack) and torch.autograd.graph.reset_saved_tensors_default_hooks(). These functions control the hooks applied to saved tensors: all tensors saved in that context will be packed using the pack function, then unpacked accordingly when needed. Currently, this works by simply calling register_hooks (cf #60975) directly at the end of the constructor of a SavedVariable. This could be optimized further by not performing the copy before registering default hooks, but this would require a small refactor. Edit: the refactor is done in #61927. A current limitation is that if users create tensors in this context, they will not be able to register additional hooks on the saved tensor. For instance, to perform something like #28997, one could define a pack function that saves to disk whenever the tensor size is too big and returns a filename, then unpack simply reads the content of the file and outputs a tensor, e.g.: ``` def pack(x): name = os.path.join(tmp_dir, str(uuid.uuid4())) torch.save(x, name) return name def unpack(name): return torch.load(name) ``` Relanding previous PR: #61834 Original PR led to timeout error in: https://www.internalfb.com/mast/job/yuguo-release_canary_offline_training-inlinecvrp_a-canary_offline_train_28a7ecfc Now passing: https://www.internalfb.com/mast/job/quach-release_canary_offline_training-inlinecvrp_a-canary_offline_train_9bb57e98 The difference with the new version is we don't need to acquire the GIL when calling `PyDefaultSavedVariableHooks::get_hooks`. Test Plan: Imported from OSS Reviewed By: iramazanli Differential Revision: D30045405 Pulled By: Varal7 fbshipit-source-id: 7f6c07af3a56fe8835d5edcc815c15ea4fb4e332

Add section to the Autograd mechanics docs to describe the recently exposed saved tensors (#52451), how to register packing / unpacking hooks (#60975) and how to use default hooks (#61834)

Add section to the Autograd mechanics docs to describe the recently exposed saved tensors (#52451), how to register packing / unpacking hooks (#60975) and how to use default hooks (#61834) ghstack-source-id: 9d9aa29 Pull Request resolved: #63647

Add section to the Autograd mechanics docs to describe the recently exposed saved tensors (#52451), how to register packing / unpacking hooks (#60975) and how to use default hooks (#61834) [ghstack-poisoned]

Summary: Add section to the Autograd mechanics docs to describe the recently exposed saved tensors (#52451), how to register packing / unpacking hooks (#60975) and how to use default hooks (#61834) Sister PR: #62361 (will add a link from autograd.rst to notes/autograd in whatever PR does not land first) Pull Request resolved: #62362 Reviewed By: soulitzer Differential Revision: D30453177 Pulled By: Varal7 fbshipit-source-id: f5759977b069ff0ef36a47b08856d297691a6caa

Apply saved tensor hooks

9e4993e

Summary: Fixes #58512 Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Varal7 requested review from albanD and soulitzer as code owners June 29, 2021 19:15

This was referenced Jun 29, 2021

Expose raw saved tensors for custom functions #60551

Closed

Add noop register hook #60685

Closed

Expose raw saved tensors for codegen functions #60565

Closed

Register Saved Tensors hooks #60663

Closed

facebook-github-bot added the cla signed label Jun 29, 2021

Varal7 added a commit that referenced this pull request Jun 29, 2021

Apply saved tensor hooks

072e9c8

Summary: Fixes #58512 Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 2417a47 Pull Request resolved: #60975

Varal7 marked this pull request as draft June 29, 2021 19:21

Varal7 mentioned this pull request Jun 29, 2021

Add hooks to SavedVariables #58512

Closed

Update on "Apply saved tensor hooks"

152c783

Summary: Fixes #58512 Test Plan: Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D29466227](https://our.internmc.facebook.com/intern/diff/D29466227) [ghstack-poisoned]

Update on "Apply saved tensor hooks"

d20943f

Summary: Fixes #58512 Test Plan: Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D29466227](https://our.internmc.facebook.com/intern/diff/D29466227) [ghstack-poisoned]

Varal7 added a commit that referenced this pull request Jun 30, 2021

Apply saved tensor hooks

185978c

Summary: Fixes #58512 Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 384fccb Pull Request resolved: #60975

Update on "Apply saved tensor hooks"

a008a62

Summary: Fixes #58512 Test Plan: Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D29466227](https://our.internmc.facebook.com/intern/diff/D29466227) [ghstack-poisoned]

Update on "Apply saved tensor hooks"

57c9e82

Summary: Fixes #58512 Test Plan: Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D29466227](https://our.internmc.facebook.com/intern/diff/D29466227) [ghstack-poisoned]

Varal7 added a commit that referenced this pull request Jun 30, 2021

Apply saved tensor hooks

9c32c60

Summary: Fixes #58512 Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 2e6248d Pull Request resolved: #60975

Update on "Apply saved tensor hooks"

1bebcce

Summary: Fixes #58512 Test Plan: Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D29466227](https://our.internmc.facebook.com/intern/diff/D29466227) [ghstack-poisoned]

Update on "Apply saved tensor hooks"

dc27e95

Summary: Fixes #58512 Test Plan: Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D29466227](https://our.internmc.facebook.com/intern/diff/D29466227) [ghstack-poisoned]

Varal7 added a commit that referenced this pull request Jul 1, 2021

Apply saved tensor hooks

4707269

Summary: Fixes #58512 Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: f3c33b6 Pull Request resolved: #60975

Update on "Apply saved tensor hooks"

b531961

Summary: Fixes #58512 Test Plan: Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D29466227](https://our.internmc.facebook.com/intern/diff/D29466227) [ghstack-poisoned]

Varal7 mentioned this pull request Jul 28, 2021

Add docs describing saved tensor hooks #62362

Closed

Varal7 mentioned this pull request Aug 2, 2021

[reland] Add default Saved Variable hooks #62563

Closed

Varal7 mentioned this pull request Aug 20, 2021

Add docs describing saved tensor hooks #63647

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Apply saved tensor hooks #60975

Apply saved tensor hooks #60975

Uh oh!

Varal7 commented Jun 29, 2021 •

edited

Loading

Uh oh!

facebook-github-bot commented Jun 29, 2021 •

edited

Loading

Uh oh!

Varal7 commented Jun 29, 2021

Uh oh!

Varal7 commented Jun 29, 2021

Uh oh!

Varal7 commented Jun 30, 2021

Uh oh!

Varal7 commented Jun 30, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Apply saved tensor hooks #60975

Apply saved tensor hooks #60975

Uh oh!

Conversation

Varal7 commented Jun 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Jun 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🕵️ 1 new failure recognized by patterns

pytorch_macos_10_13_py3_test (1/1)

Uh oh!

Varal7 commented Jun 29, 2021

Uh oh!

Varal7 commented Jun 29, 2021

Uh oh!

Varal7 commented Jun 30, 2021

Uh oh!

Varal7 commented Jun 30, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Varal7 commented Jun 29, 2021 •

edited

Loading

facebook-github-bot commented Jun 29, 2021 •

edited

Loading