Skip to content

Conversation

@Varal7
Copy link
Contributor

@Varal7 Varal7 commented Jun 29, 2021

Stack from ghstack:

Summary: Fixes #58512
Uses the hooks introduced in #60663: upon registering, the pack hook is called and the returned python object is stored. From then on, whenever we need to unpack it, we will use that python object in combination with the unpack hook.
The packing can be done with gradient tracking disabled as we will add back the correct grad_fn during unpacking.
Inplace operations done by the pack_hook on the original tensor (in the case leaf || !output) will be caught if the Saved tensor is used by another op.
Inplace operations done by unpack_hook will unfortunately not be caught. We will add a warning in the docs that follows this PR.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: D29466227

Summary: Fixes #58512

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jun 29, 2021

💊 CI failures summary and remediations

As of commit 366ad39 (more details on the Dr. CI page and at hud.pytorch.org/pr/60975):


  • 1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_macos_10_13_py3_test (1/1)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

Jul 17 00:10:37 test_remote_message_script_de...yUniqueId(created_on=0, local_id=0) to be created.
Jul 17 00:10:09 frame #12: std::__1::__function::__func<std::__1::__bind<torch::distributed::rpc::ProcessGroupAgent::enqueueRecv(torch::distributed::rpc::RecvWork)::$_6, torch::distributed::rpc::RecvWork>, std::__1::allocator<std::__1::__bind<torch::distributed::rpc::ProcessGroupAgent::enqueueRecv(torch::distributed::rpc::RecvWork)::$_6, torch::distributed::rpc::RecvWork> >, void ()>::operator()() + 42 (0x1171145ba in libtorch_cpu.dylib)
Jul 17 00:10:09 frame #13: c10::ThreadPool::main_loop(unsigned long) + 569 (0x10e7dd369 in libc10.dylib)
Jul 17 00:10:09 frame #14: void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, c10::ThreadPool::ThreadPool(int, int, std::__1::function<void ()>)::$_0> >(void*) + 67 (0x10e7dda13 in libc10.dylib)
Jul 17 00:10:09 frame #15: _pthread_start + 148 (0x7fff69698109 in libsystem_pthread.dylib)
Jul 17 00:10:09 frame #16: thread_start + 15 (0x7fff69693b8b in libsystem_pthread.dylib)
Jul 17 00:10:09 
Jul 17 00:10:09 ok (4.214s)
Jul 17 00:10:17   test_remote_message_dropped_pickle (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (8.438s)
Jul 17 00:10:26   test_remote_message_dropped_pickle_to_self (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (8.422s)
Jul 17 00:10:33   test_remote_message_script_delay_timeout (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (7.297s)
Jul 17 00:10:37   test_remote_message_script_delay_timeout_to_self (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... [E request_callback_no_python.cpp:555] Received error while processing request type 260: falseINTERNAL ASSERT FAILED at "../torch/csrc/distributed/rpc/rref_context.cpp":390, please report a bug to PyTorch. Expected OwnerRRef with id GloballyUniqueId(created_on=0, local_id=0) to be created.
Jul 17 00:10:37 Exception raised from getOwnerRRef at ../torch/csrc/distributed/rpc/rref_context.cpp:390 (most recent call first):
Jul 17 00:10:37 frame #0: c10::Error::Error(c10::SourceLocation, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >) + 98 (0x1103e86b2 in libc10.dylib)
Jul 17 00:10:37 frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 106 (0x1103e6e2a in libc10.dylib)
Jul 17 00:10:37 frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 64 (0x1103e7060 in libc10.dylib)
Jul 17 00:10:37 frame #3: torch::distributed::rpc::RRefContext::getOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, bool) + 1711 (0x11960eabf in libtorch_cpu.dylib)
Jul 17 00:10:37 frame #4: torch::distributed::rpc::RequestCallbackNoPython::assignOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, torch::distributed::rpc::GloballyUniqueId const&, c10::intrusive_ptr<c10::ivalue::Future, c10::detail::intrusive_target_default_null_type<c10::ivalue::Future> >) const + 86 (0x1195f9316 in libtorch_cpu.dylib)
Jul 17 00:10:37 frame #5: torch::distributed::rpc::RequestCallbackImpl::processScriptRemoteCall(torch::distributed::rpc::RpcCommandBase&, std::__1::vector<c10::Stream, std::__1::allocator<c10::Stream> >) const + 376 (0x10f9e4b58 in libtorch_python.dylib)
Jul 17 00:10:37 frame #6: torch::distributed::rpc::RequestCallbackNoPython::processRpc(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::__1::vector<c10::Stream, std::__1::allocator<c10::Stream> >) const + 437 (0x1195f7f65 in libtorch_cpu.dylib)
Jul 17 00:10:37 frame #7: torch::distributed::rpc::RequestCallbackImpl::processRpcWithErrors(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::__1::vector<c10::Stream, std::__1::allocator<c10::Stream> >) const + 74 (0x10f9e58ca in libtorch_python.dylib)
Jul 17 00:10:37 frame #8: c10::intrusive_ptr<c10::ivalue::Future, c10::detail::intrusive_target_default_null_type<c10::ivalue::Future> > c10::ivalue::Future::thenAsync<torch::distributed::rpc::RequestCallbackNoPython::processMessage(torch::distributed::rpc::Message&, std::__1::vector<c10::Stream, std::__1::allocator<c10::Stream> >) const::$_1>(torch::distributed::rpc::RequestCallbackNoPython::processMessage(torch::distributed::rpc::Message&, std::__1::vector<c10::Stream, std::__1::allocator<c10::Stream> >) const::$_1, std::__1::shared_ptr<c10::Type>)::'lambda'(c10::ivalue::Future&)::operator()(c10::ivalue::Future&) + 223 (0x1195ffc2f in libtorch_cpu.dylib)

Preview docs built from this PR

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Varal7 added a commit that referenced this pull request Jun 29, 2021
Summary: Fixes #58512

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 2417a47
Pull Request resolved: #60975
@Varal7
Copy link
Contributor Author

Varal7 commented Jun 29, 2021

@Varal7 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@Varal7 Varal7 marked this pull request as draft June 29, 2021 19:21
Summary: Fixes #58512

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D29466227](https://our.internmc.facebook.com/intern/diff/D29466227)

[ghstack-poisoned]
@Varal7
Copy link
Contributor Author

Varal7 commented Jun 29, 2021

@Varal7 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: Fixes #58512

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D29466227](https://our.internmc.facebook.com/intern/diff/D29466227)

[ghstack-poisoned]
Varal7 added a commit that referenced this pull request Jun 30, 2021
Summary: Fixes #58512

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 384fccb
Pull Request resolved: #60975
@Varal7
Copy link
Contributor Author

Varal7 commented Jun 30, 2021

@Varal7 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: Fixes #58512

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D29466227](https://our.internmc.facebook.com/intern/diff/D29466227)

[ghstack-poisoned]
@Varal7
Copy link
Contributor Author

Varal7 commented Jun 30, 2021

@Varal7 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: Fixes #58512

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D29466227](https://our.internmc.facebook.com/intern/diff/D29466227)

[ghstack-poisoned]
Varal7 added a commit that referenced this pull request Jun 30, 2021
Summary: Fixes #58512

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 2e6248d
Pull Request resolved: #60975
Summary: Fixes #58512

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D29466227](https://our.internmc.facebook.com/intern/diff/D29466227)

[ghstack-poisoned]
Summary: Fixes #58512

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D29466227](https://our.internmc.facebook.com/intern/diff/D29466227)

[ghstack-poisoned]
Varal7 added a commit that referenced this pull request Jul 1, 2021
Summary: Fixes #58512

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: f3c33b6
Pull Request resolved: #60975
Summary: Fixes #58512

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D29466227](https://our.internmc.facebook.com/intern/diff/D29466227)

[ghstack-poisoned]
Varal7 added a commit that referenced this pull request Jul 23, 2021
Fixes #58659. This PR builds directly on top of [#58512](#58512 (comment)).

~~Creates a context manager `with torch.autograd.graph.saved_tensors_default_hooks(pack, unpack)` that can be used on the Python side.~~

Expose a pair of functions to Python users: `torch.autograd.graph.set_saved_tensors_default_hooks(pack, unpack)` and `torch.autograd.graph.reset_saved_tensors_default_hooks()`.
These functions control the hooks applied to saved tensors: all tensors *saved* in that context will be packed using the `pack` function, then unpacked accordingly when needed.

Currently, this works by simply calling `register_hooks` (cf #60975) directly at the end of the constructor of a SavedVariable. This could be optimized further by not performing the copy before registering default hooks, but this would require a small refactor. Edit: the refactor is done in #61927.

A current limitation is that if users create tensors in this context, they will not be able to register additional hooks on the saved tensor.

For instance, to perform something like #28997, one could define a `pack` function that saves to disk whenever the tensor size is too big and returns a filename, then `unpack` simply reads the content of the file and outputs a tensor, e.g.:

```python
def pack(x):
    name = os.path.join(tmp_dir, str(uuid.uuid4()))
    torch.save(x, name)
    return name

def unpack(name):
    return torch.load(name)
```

Differential Revision: [D29792193](https://our.internmc.facebook.com/intern/diff/D29792193)

[ghstack-poisoned]
Varal7 added a commit that referenced this pull request Jul 23, 2021
Fixes #58659. This PR builds directly on top of [#58512](#58512 (comment)).

~~Creates a context manager `with torch.autograd.graph.saved_tensors_default_hooks(pack, unpack)` that can be used on the Python side.~~

Expose a pair of functions to Python users: `torch.autograd.graph.set_saved_tensors_default_hooks(pack, unpack)` and `torch.autograd.graph.reset_saved_tensors_default_hooks()`.
These functions control the hooks applied to saved tensors: all tensors *saved* in that context will be packed using the `pack` function, then unpacked accordingly when needed.

Currently, this works by simply calling `register_hooks` (cf #60975) directly at the end of the constructor of a SavedVariable. This could be optimized further by not performing the copy before registering default hooks, but this would require a small refactor. Edit: the refactor is done in #61927.

A current limitation is that if users create tensors in this context, they will not be able to register additional hooks on the saved tensor.

For instance, to perform something like #28997, one could define a `pack` function that saves to disk whenever the tensor size is too big and returns a filename, then `unpack` simply reads the content of the file and outputs a tensor, e.g.:

```python
def pack(x):
    name = os.path.join(tmp_dir, str(uuid.uuid4()))
    torch.save(x, name)
    return name

def unpack(name):
    return torch.load(name)
```

Differential Revision: [D29792193](https://our.internmc.facebook.com/intern/diff/D29792193)

[ghstack-poisoned]
Varal7 added a commit that referenced this pull request Jul 23, 2021
Fixes #58659. This PR builds directly on top of [#58512](#58512 (comment)).

~~Creates a context manager `with torch.autograd.graph.saved_tensors_default_hooks(pack, unpack)` that can be used on the Python side.~~

Expose a pair of functions to Python users: `torch.autograd.graph.set_saved_tensors_default_hooks(pack, unpack)` and `torch.autograd.graph.reset_saved_tensors_default_hooks()`.
These functions control the hooks applied to saved tensors: all tensors *saved* in that context will be packed using the `pack` function, then unpacked accordingly when needed.

Currently, this works by simply calling `register_hooks` (cf #60975) directly at the end of the constructor of a SavedVariable. This could be optimized further by not performing the copy before registering default hooks, but this would require a small refactor. Edit: the refactor is done in #61927.

A current limitation is that if users create tensors in this context, they will not be able to register additional hooks on the saved tensor.

For instance, to perform something like #28997, one could define a `pack` function that saves to disk whenever the tensor size is too big and returns a filename, then `unpack` simply reads the content of the file and outputs a tensor, e.g.:

```python
def pack(x):
    name = os.path.join(tmp_dir, str(uuid.uuid4()))
    torch.save(x, name)
    return name

def unpack(name):
    return torch.load(name)
```

Differential Revision: [D29792193](https://our.internmc.facebook.com/intern/diff/D29792193)

[ghstack-poisoned]
facebook-github-bot pushed a commit that referenced this pull request Jul 26, 2021
Summary:
Pull Request resolved: #61834

Expose a pair of functions to Python users: torch.autograd.graph.set_saved_tensors_default_hooks(pack, unpack) and torch.autograd.graph.reset_saved_tensors_default_hooks().
These functions control the hooks applied to saved tensors: all tensors saved in that context will be packed using the pack function, then unpacked accordingly when needed.

Currently, this works by simply calling register_hooks (cf #60975) directly at the end of the constructor of a SavedVariable. This could be optimized further by not performing the copy before registering default hooks, but this would require a small refactor. Edit: the refactor is done in #61927.

A current limitation is that if users create tensors in this context, they will not be able to register additional hooks on the saved tensor.

For instance, to perform something like #28997, one could define a pack function that saves to disk whenever the tensor size is too big and returns a filename, then unpack simply reads the content of the file and outputs a tensor, e.g.:

```
def pack(x):
    name = os.path.join(tmp_dir, str(uuid.uuid4()))
    torch.save(x, name)
    return name

def unpack(name):
    return torch.load(name)
```

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D29792193

Pulled By: Varal7

fbshipit-source-id: 33e931230ef59faa3ec8b5d11ef7c05539bce77c
Varal7 added a commit to Varal7/pytorch that referenced this pull request Jul 28, 2021
Add section to the Autograd mechanics docs to describe the recently
exposed saved tensors (pytorch#52451), how to register packing / unpacking
hooks (pytorch#60975) and how to use default hooks (pytorch#61834)
Varal7 added a commit that referenced this pull request Jul 28, 2021
Add section to the Autograd mechanics docs to describe the recently
exposed saved tensors (#52451), how to register packing / unpacking
hooks (#60975) and how to use default hooks (#61834)
Varal7 added a commit that referenced this pull request Jul 29, 2021
Add section to the Autograd mechanics docs to describe the recently
exposed saved tensors (#52451), how to register packing / unpacking
hooks (#60975) and how to use default hooks (#61834)
Varal7 added a commit that referenced this pull request Jul 29, 2021
Add section to the Autograd mechanics docs to describe the recently
exposed saved tensors (#52451), how to register packing / unpacking
hooks (#60975) and how to use default hooks (#61834)
Varal7 added a commit that referenced this pull request Aug 2, 2021
Summary:

Expose a pair of functions to Python users: torch.autograd.graph.set_saved_tensors_default_hooks(pack, unpack) and torch.autograd.graph.reset_saved_tensors_default_hooks().
These functions control the hooks applied to saved tensors: all tensors saved in that context will be packed using the pack function, then unpacked accordingly when needed.

Currently, this works by simply calling register_hooks (cf #60975) directly at the end of the constructor of a SavedVariable. This could be optimized further by not performing the copy before registering default hooks, but this would require a small refactor. Edit: the refactor is done in #61927.

A current limitation is that if users create tensors in this context, they will not be able to register additional hooks on the saved tensor.

For instance, to perform something like #28997, one could define a pack function that saves to disk whenever the tensor size is too big and returns a filename, then unpack simply reads the content of the file and outputs a tensor, e.g.:

```
def pack(x):
    name = os.path.join(tmp_dir, str(uuid.uuid4()))
    torch.save(x, name)
    return name

def unpack(name):
    return torch.load(name)
```

Relanding previous PR: #61834

Original PR led to timeout error in: https://www.internalfb.com/mast/job/yuguo-release_canary_offline_training-inlinecvrp_a-canary_offline_train_28a7ecfc

Now passing: https://www.internalfb.com/mast/job/quach-release_canary_offline_training-inlinecvrp_a-canary_offline_train_9bb57e98

The difference with the new version is we don't need to acquire the GIL when calling `PyDefaultSavedVariableHooks::get_hooks`.

[ghstack-poisoned]
Varal7 added a commit that referenced this pull request Aug 2, 2021
Summary:

Expose a pair of functions to Python users: torch.autograd.graph.set_saved_tensors_default_hooks(pack, unpack) and torch.autograd.graph.reset_saved_tensors_default_hooks().
These functions control the hooks applied to saved tensors: all tensors saved in that context will be packed using the pack function, then unpacked accordingly when needed.

Currently, this works by simply calling register_hooks (cf #60975) directly at the end of the constructor of a SavedVariable. This could be optimized further by not performing the copy before registering default hooks, but this would require a small refactor. Edit: the refactor is done in #61927.

A current limitation is that if users create tensors in this context, they will not be able to register additional hooks on the saved tensor.

For instance, to perform something like #28997, one could define a pack function that saves to disk whenever the tensor size is too big and returns a filename, then unpack simply reads the content of the file and outputs a tensor, e.g.:

```
def pack(x):
    name = os.path.join(tmp_dir, str(uuid.uuid4()))
    torch.save(x, name)
    return name

def unpack(name):
    return torch.load(name)
```

Relanding previous PR: #61834

Original PR led to timeout error in: https://www.internalfb.com/mast/job/yuguo-release_canary_offline_training-inlinecvrp_a-canary_offline_train_28a7ecfc

Now passing: https://www.internalfb.com/mast/job/quach-release_canary_offline_training-inlinecvrp_a-canary_offline_train_9bb57e98

The difference with the new version is we don't need to acquire the GIL when calling `PyDefaultSavedVariableHooks::get_hooks`.

[ghstack-poisoned]
facebook-github-bot pushed a commit that referenced this pull request Aug 2, 2021
Summary:
Pull Request resolved: #62563

Expose a pair of functions to Python users: torch.autograd.graph.set_saved_tensors_default_hooks(pack, unpack) and torch.autograd.graph.reset_saved_tensors_default_hooks().
These functions control the hooks applied to saved tensors: all tensors saved in that context will be packed using the pack function, then unpacked accordingly when needed.

Currently, this works by simply calling register_hooks (cf #60975) directly at the end of the constructor of a SavedVariable. This could be optimized further by not performing the copy before registering default hooks, but this would require a small refactor. Edit: the refactor is done in #61927.

A current limitation is that if users create tensors in this context, they will not be able to register additional hooks on the saved tensor.

For instance, to perform something like #28997, one could define a pack function that saves to disk whenever the tensor size is too big and returns a filename, then unpack simply reads the content of the file and outputs a tensor, e.g.:

```
def pack(x):
    name = os.path.join(tmp_dir, str(uuid.uuid4()))
    torch.save(x, name)
    return name

def unpack(name):
    return torch.load(name)
```

Relanding previous PR: #61834

Original PR led to timeout error in: https://www.internalfb.com/mast/job/yuguo-release_canary_offline_training-inlinecvrp_a-canary_offline_train_28a7ecfc

Now passing: https://www.internalfb.com/mast/job/quach-release_canary_offline_training-inlinecvrp_a-canary_offline_train_9bb57e98

The difference with the new version is we don't need to acquire the GIL when calling `PyDefaultSavedVariableHooks::get_hooks`.

Test Plan: Imported from OSS

Reviewed By: iramazanli

Differential Revision: D30045405

Pulled By: Varal7

fbshipit-source-id: 7f6c07af3a56fe8835d5edcc815c15ea4fb4e332
Varal7 added a commit that referenced this pull request Aug 3, 2021
Add section to the Autograd mechanics docs to describe the recently
exposed saved tensors (#52451), how to register packing / unpacking
hooks (#60975) and how to use default hooks (#61834)
Varal7 added a commit that referenced this pull request Aug 12, 2021
Add section to the Autograd mechanics docs to describe the recently
exposed saved tensors (#52451), how to register packing / unpacking
hooks (#60975) and how to use default hooks (#61834)
Varal7 added a commit that referenced this pull request Aug 13, 2021
Add section to the Autograd mechanics docs to describe the recently
exposed saved tensors (#52451), how to register packing / unpacking
hooks (#60975) and how to use default hooks (#61834)
Varal7 added a commit that referenced this pull request Aug 19, 2021
Add section to the Autograd mechanics docs to describe the recently
exposed saved tensors (#52451), how to register packing / unpacking
hooks (#60975) and how to use default hooks (#61834)
Varal7 added a commit that referenced this pull request Aug 20, 2021
Add section to the Autograd mechanics docs to describe the recently
exposed saved tensors (#52451), how to register packing / unpacking
hooks (#60975) and how to use default hooks (#61834)

ghstack-source-id: 9d9aa29
Pull Request resolved: #63647
Varal7 added a commit that referenced this pull request Aug 20, 2021
Add section to the Autograd mechanics docs to describe the recently
exposed saved tensors (#52451), how to register packing / unpacking
hooks (#60975) and how to use default hooks (#61834)

[ghstack-poisoned]
facebook-github-bot pushed a commit that referenced this pull request Aug 20, 2021
Summary:
Add section to the Autograd mechanics docs to describe the recently
exposed saved tensors (#52451), how to register packing / unpacking
hooks (#60975) and how to use default hooks (#61834)

Sister PR: #62361 (will add a link from autograd.rst to notes/autograd in whatever PR does not land first)

Pull Request resolved: #62362

Reviewed By: soulitzer

Differential Revision: D30453177

Pulled By: Varal7

fbshipit-source-id: f5759977b069ff0ef36a47b08856d297691a6caa
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants