-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Improve hooks ordering behavior #85849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/85849
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 FailuresAs of commit 07b732c: FLAKY - The following jobs failed but were likely due to flakiness present on master:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Notable points: - With this PR PyFunctionTensorPreHooks registered to leaf tensor will not change the input you receive in the post hook, but I think that is just the status quo - We'd like for pre hooks (but not post hooks) to be able to modify the gradients before they are captured (this replicates what happens when accumulate_grad=true. - Hooks registered to tensors (PyFunctionTensorPreHooks as opposed to PyFunctionHooks(Pre|Post)Hooks) are not registered to the accumulate grad node. They are saved on the tensor's autograd_meta and accessed exclusively by AccumulateGrad::apply. - I'm not actually sure why it is the case yet. Maybe we should fix it? Alternative design: - Actually set the needed_ of the outputs to true even when accumulate_grad=false, but also modify AccumulateGrad::apply to behave as a no-op when this happens. This seemed easier at first, but I ended up not doing this I wanted to avoid this TLS accesses. [ghstack-poisoned]
Notable points: - With this PR PyFunctionTensorPreHooks registered to leaf tensor will not change the input you receive in the post hook, but I think that is just the status quo - We'd like for pre hooks (but not post hooks) to be able to modify the gradients before they are captured (this replicates what happens when accumulate_grad=true. - Hooks registered to tensors (PyFunctionTensorPreHooks as opposed to PyFunctionHooks(Pre|Post)Hooks) are not registered to the accumulate grad node. They are saved on the tensor's autograd_meta and accessed exclusively by AccumulateGrad::apply. - I'm not actually sure why it is the case yet. Maybe we should fix it? Alternative design: - Actually set the needed_ of the outputs to true even when accumulate_grad=false, but also modify AccumulateGrad::apply to behave as a no-op when this happens. This seemed easier at first, but I ended up not doing this I wanted to avoid this TLS accesses. [ghstack-poisoned]
Notable points: - With this PR PyFunctionTensorPreHooks registered to leaf tensor will not change the input you receive in the post hook, but I think that is just the status quo - We'd like for pre hooks (but not post hooks) to be able to modify the gradients before they are captured (this replicates what happens when accumulate_grad=true. - Hooks registered to tensors (PyFunctionTensorPreHooks as opposed to PyFunctionHooks(Pre|Post)Hooks) are not registered to the accumulate grad node. They are saved on the tensor's autograd_meta and accessed exclusively by AccumulateGrad::apply. - I'm not actually sure why it is the case yet. Maybe we should fix it? Alternative design: - Actually set the needed_ of the outputs to true even when accumulate_grad=false, but also modify AccumulateGrad::apply to behave as a no-op when this happens. This seemed easier at first, but I ended up not doing this I wanted to avoid this TLS accesses. [ghstack-poisoned]
Notable points: - With this PR PyFunctionTensorPreHooks registered to leaf tensor will not change the input you receive in the post hook, but I think that is just the status quo - We'd like for pre hooks (but not post hooks) to be able to modify the gradients before they are captured (this replicates what happens when accumulate_grad=true. - Hooks registered to tensors (PyFunctionTensorPreHooks as opposed to PyFunctionHooks(Pre|Post)Hooks) are not registered to the accumulate grad node. They are saved on the tensor's autograd_meta and accessed exclusively by AccumulateGrad::apply. - I'm not actually sure why it is the case yet. Maybe we should fix it? Alternative design: - Actually set the needed_ of the outputs to true even when accumulate_grad=false, but also modify AccumulateGrad::apply to behave as a no-op when this happens. This seemed easier at first, but I ended up not doing this I wanted to avoid this TLS accesses. [ghstack-poisoned]
Notable points: - With this PR PyFunctionTensorPreHooks registered to leaf tensor will not change the input you receive in the post hook, but I think that is just the status quo - We'd like for pre hooks (but not post hooks) to be able to modify the gradients before they are captured (this replicates what happens when accumulate_grad=true. - Hooks registered to tensors (PyFunctionTensorPreHooks as opposed to PyFunctionHooks(Pre|Post)Hooks) are not registered to the accumulate grad node. They are saved on the tensor's autograd_meta and accessed exclusively by AccumulateGrad::apply. - I'm not actually sure why it is the case yet. Maybe we should fix it? Alternative design: - Actually set the needed_ of the outputs to true even when accumulate_grad=false, but also modify AccumulateGrad::apply to behave as a no-op when this happens. This seemed easier at first, but I ended up not doing this I wanted to avoid this TLS accesses. [ghstack-poisoned]
Notable points: - With this PR PyFunctionTensorPreHooks registered to leaf tensor will not change the input you receive in the post hook, but I think that is just the status quo - We'd like for pre hooks (but not post hooks) to be able to modify the gradients before they are captured (this replicates what happens when accumulate_grad=true. - Hooks registered to tensors (PyFunctionTensorPreHooks as opposed to PyFunctionHooks(Pre|Post)Hooks) are not registered to the accumulate grad node. They are saved on the tensor's autograd_meta and accessed exclusively by AccumulateGrad::apply. - I'm not actually sure why it is the case yet. Maybe we should fix it? Alternative design: - Actually set the needed_ of the outputs to true even when accumulate_grad=false, but also modify AccumulateGrad::apply to behave as a no-op when this happens. This seemed easier at first, but I ended up not doing this I wanted to avoid this TLS accesses. [ghstack-poisoned]
Notable points: - With this PR PyFunctionTensorPreHooks registered to leaf tensor will not change the input you receive in the post hook, but I think that is just the status quo - We'd like for pre hooks (but not post hooks) to be able to modify the gradients before they are captured (this replicates what happens when accumulate_grad=true. - Hooks registered to tensors (PyFunctionTensorPreHooks as opposed to PyFunctionHooks(Pre|Post)Hooks) are not registered to the accumulate grad node. They are saved on the tensor's autograd_meta and accessed exclusively by AccumulateGrad::apply. - I'm not actually sure why it is the case yet. Maybe we should fix it? Alternative design: - Actually set the needed_ of the outputs to true even when accumulate_grad=false, but also modify AccumulateGrad::apply to behave as a no-op when this happens. This seemed easier at first, but I ended up not doing this I wanted to avoid this TLS accesses. [ghstack-poisoned]
…e capture" Notable points: - With this PR PyFunctionTensorPreHooks registered to leaf tensor will not change the input you receive in the post hook, but I think that is just the status quo - We'd like for pre hooks (but not post hooks) to be able to modify the gradients before they are captured (this replicates what happens when accumulate_grad=true. - Hooks registered to tensors (PyFunctionTensorPreHooks as opposed to PyFunctionHooks(Pre|Post)Hooks) are not registered to the accumulate grad node. They are saved on the tensor's autograd_meta and accessed exclusively by AccumulateGrad::apply. - I'm not actually sure why it is the case yet. Maybe we should fix it? Alternative design: - Actually set the needed_ of the outputs to true even when accumulate_grad=false, but also modify AccumulateGrad::apply to behave as a no-op when this happens. This seemed easier at first, but I ended up not doing this I wanted to avoid this TLS accesses. [ghstack-poisoned]
…e capture" Notable points: - With this PR PyFunctionTensorPreHooks registered to leaf tensor will not change the input you receive in the post hook, but I think that is just the status quo - We'd like for pre hooks (but not post hooks) to be able to modify the gradients before they are captured (this replicates what happens when accumulate_grad=true. - Hooks registered to tensors (PyFunctionTensorPreHooks as opposed to PyFunctionHooks(Pre|Post)Hooks) are not registered to the accumulate grad node. They are saved on the tensor's autograd_meta and accessed exclusively by AccumulateGrad::apply. - I'm not actually sure why it is the case yet. Maybe we should fix it? Alternative design: - Actually set the needed_ of the outputs to true even when accumulate_grad=false, but also modify AccumulateGrad::apply to behave as a no-op when this happens. This seemed easier at first, but I ended up not doing this I wanted to avoid this TLS accesses. [ghstack-poisoned]
albanD
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
This is BC-breaking though?
| new_grad = (*hook)({new_grad})[0]; | ||
| } | ||
| return new_grad; | ||
| std::vector<std::unique_ptr<FunctionPreHook>>& tensor_pre_hooks() noexcept |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we have the same for retain_grad hook? Or is it assumed that leaf Tensors can't have retain_grad hooks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think it is ok to assume that.
| // Returns the inputs as a list of variables. Destroys given InputBuffer. | ||
| static std::vector<Variable> variables(InputBuffer&& g); | ||
|
|
||
| private: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me see if I can avoid this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can easily avoid this for now. It should be gone for sure though when we eventually remove the usage of captures hooks, so should just be a temporary hack.
| var->mutable_grad() = var->grad() + grad; | ||
| } | ||
| } | ||
| return at::TensorBase{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the contract on the returned value here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the return value is defined, then it will modify the gradient; otherwise if it is not-defined, then this hook is considered as only "reading" the value of the gradient.
See:
pytorch/torch/csrc/autograd/cpp_hook.cpp
Lines 38 to 40 in 645fb21
| if (!res.defined()) { | |
| // Don't change gradient | |
| continue; |
| // we will additionally register a single hook to the grad_fn. | ||
| // | ||
| // Note that the cpp and python use cases aren't actually aware of | ||
| // each other, so using both is not defined behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would happen in practice before this PR? They will just run in arbitrary order?
What about after this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the python side the hooks_ field is cleared everytime backward_hooks field is set, and a PyFunctionHook is registered.
From the cpp side, hooks_ field is set whenever a cpp hook is registered AND cpp_hooks_list_ is empty. (In the case where we register a retains_grad hook that is also the case, but it shouldn't matter because hooks_ is only used on leaf nodes and retains_grad is no-op when invoked on leaf node)
What happens is that one will be overwritten by the other.
This is still the case after this PR, but I'm okay with not caring about mixed Python + cpp case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ho wow this is very badly broken :O good to know!
Addresses: #35802 Design doc: https://docs.google.com/document/d/19xSib7FFknRQ5f3ptGFUmiOt3BrgXSUlTQH2xMcZJYg/edit# ### Changes in this PR #### Implementation - We have now have 3 fields: pre_hooks, retains_grad_hooks, and tensor_pre_hooks so that we can more precisely define their ordering and when they are executed. - Since retains grad uses an entirely new field, we cannot reuse the old retains grad, logic. We refactor retains grad to call directly into the variable.cpp logic. Other logic in variable.cpp that handle cpp hooks must also be updated. #### Hooks ordering and execution: - Defines pre-hooks registered on tensor to run before pre-hooks registered on grad_fn - Updates pre-hooks registered on tensor to always run, even if they are the inputs= to .grad() - Post hooks (and pre hooks) can now observe the modifications to gradient by the tensor pre hook #### Retains grad hooks - retains grad hooks always execute last, even if there are other tensor pre-hooks registered #### Unchanged: - pre_hooks registered to grad_fn aren't expected to execute if they are the inputs= to .grad() Follow ups: - simplify retains_grad field to not be a vector, since it always holds a single hook - potentially merge capture hooks with tensor pre hooks, this would involve some additional refactoring since - python hooks registered to tensor behavior on in-place is still wrong [ghstack-poisoned]
albanD
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lint needs fixing but sounds good otherwise!
Addresses: #35802 Design doc: https://docs.google.com/document/d/19xSib7FFknRQ5f3ptGFUmiOt3BrgXSUlTQH2xMcZJYg/edit# ### Changes in this PR #### Implementation - We have now have 3 fields: pre_hooks, retains_grad_hooks, and tensor_pre_hooks so that we can more precisely define their ordering and when they are executed. - Since retains grad uses an entirely new field, we cannot reuse the old retains grad, logic. We refactor retains grad to call directly into the variable.cpp logic. Other logic in variable.cpp that handle cpp hooks must also be updated. #### Hooks ordering and execution: - Defines pre-hooks registered on tensor to run before pre-hooks registered on grad_fn - Updates pre-hooks registered on tensor to always run, even if they are the inputs= to .grad() - Post hooks (and pre hooks) can now observe the modifications to gradient by the tensor pre hook #### Retains grad hooks - retains grad hooks always execute last, even if there are other tensor pre-hooks registered #### Unchanged: - pre_hooks registered to grad_fn aren't expected to execute if they are the inputs= to .grad() Follow ups: - simplify retains_grad field to not be a vector, since it always holds a single hook - potentially merge capture hooks with tensor pre hooks, this would involve some additional refactoring since - python hooks registered to tensor behavior on in-place is still wrong [ghstack-poisoned]
Addresses: #35802 Design doc: https://docs.google.com/document/d/19xSib7FFknRQ5f3ptGFUmiOt3BrgXSUlTQH2xMcZJYg/edit# ### Changes in this PR #### Implementation - We have now have 3 fields: pre_hooks, retains_grad_hooks, and tensor_pre_hooks so that we can more precisely define their ordering and when they are executed. - Since retains grad uses an entirely new field, we cannot reuse the old retains grad, logic. We refactor retains grad to call directly into the variable.cpp logic. Other logic in variable.cpp that handle cpp hooks must also be updated. #### Hooks ordering and execution: - Defines pre-hooks registered on tensor to run before pre-hooks registered on grad_fn - Updates pre-hooks registered on tensor to always run, even if they are the inputs= to .grad() - Post hooks (and pre hooks) can now observe the modifications to gradient by the tensor pre hook #### Retains grad hooks - retains grad hooks always execute last, even if there are other tensor pre-hooks registered #### Unchanged: - pre_hooks registered to grad_fn aren't expected to execute if they are the inputs= to .grad() Follow ups: - simplify retains_grad field to not be a vector, since it always holds a single hook - potentially merge capture hooks with tensor pre hooks, this would involve some additional refactoring since - python hooks registered to tensor behavior on in-place is still wrong cc ezyang gchanan [ghstack-poisoned]
|
@pytorchbot merge -f "Unrelated failures" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
| return pre_hooks_; | ||
| } | ||
|
|
||
| virtual std::vector<std::unique_ptr<FunctionPreHook>>& |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this virtual be an error ? how different tensor_pre_hooks_ from retains_grad_hooks_ ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how different tensor_pre_hooks_ from retains_grad_hooks_ ?
there's a class that inherits from Node that overrides tensor_pre_hooks_
|
@pytorchbot revert -h |
|
❌ 🤖 pytorchbot command failed: Try |
|
@pytorchbot revert -m "fails internal build" -c nosignal |
|
@pytorchbot successfully started a revert job. Check the current status here. |
|
@soulitzer your PR has been successfully reverted. |
This reverts commit 049838f. Reverted #85849 on behalf of https://github.com/albanD due to fails internal build
This reverts commit e525f43. [ghstack-poisoned]
This reverts commit e525f43. [ghstack-poisoned]
This reverts commit e525f43. Original PR: #85849 Fixes #ISSUE_NUMBER In addition to reverting the revert, this PR: - defines the virtual destructor of FunctionPreHook in the header. Why? Presumably the internal build imports the header from somewhere, but does not have function_hooks.cpp (where the virtual destructor was previously defined) in the same compilation unit. Pull Request resolved: #92559 Approved by: https://github.com/albanD
Stack from ghstack (oldest at bottom):
Addresses: #35802
Design doc: https://docs.google.com/document/d/19xSib7FFknRQ5f3ptGFUmiOt3BrgXSUlTQH2xMcZJYg/edit#
Changes in this PR
Implementation
Hooks ordering and execution:
Retains grad hooks
Unchanged:
Follow ups:
cc @ezyang @gchanan