Skip to content

Conversation

@mrshenli
Copy link
Contributor

@mrshenli mrshenli commented Nov 28, 2022

Stack from ghstack (oldest at bottom):

With combining FSDP with reentrant checkpointing, the post backward
hook might run twice, and then hit this
error
.
This is because reentrant backward uses nested autograd GraphTasks.
The inner GraphTask is not aware of the outer one and therefore
will flush pending AccumulateGrad invocations on exit, which in
turn triggers the post backward hooks registered by FSDP. Later,
the outer GraphTask will trigger that again, leading to the above
error.

PR #89791 relaxes the FSDP training state check, but we still run
into grad value check failures occasionally. Therefore, this PR only
lands the test for non-reentrant test, and we can enable the
reentrant test when the accuracy issues are addressed.

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 28, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/89781

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 73c7752:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Nov 28, 2022
Copy link
Collaborator

@awgu awgu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding these tests!

At a meta level, I am wondering how we should approach testing more interleavings in a systematic and complete way.

Also, I am not sure if you want to wait for the assert relaxation PR to land and then update test_checkpoint_submodule_reentrant() or not.

Copy link
Contributor

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, stamping to unblock. Will file a follow up issue to debug why this FSDP + AC structure does not work.

With combining FSDP with reentrant checkpointing, the post backward
hook might run twice, and then hit [this
error](https://github.com/pytorch/pytorch/blob/e20ec44544c17d6d3d411f88b870e05043bda731/torch/distributed/fsdp/_runtime_utils.py#L487).
This is because reentrant backward uses nested autograd GraphTasks.
The inner GraphTask is not aware of the outer one and therefore
will flush pending `AccumulateGrad` invocations on exit, which in
turn triggers the post backward hooks registered by FSDP. Later,
the outer GraphTask will trigger that again, leading to the above
error.

PR #89791 relaxes the FSDP training state check, but we still run
into grad value check failures occasionally. Therefore, this PR only
lands the test for non-reentrant test, and we can enable the
reentrant test when the accuracy issues are addressed.

[ghstack-poisoned]
mrshenli added a commit that referenced this pull request Nov 29, 2022
With combining FSDP with reentrant checkpointing, the post backward
hook might run twice, and then hit [this
error](https://github.com/pytorch/pytorch/blob/e20ec44544c17d6d3d411f88b870e05043bda731/torch/distributed/fsdp/_runtime_utils.py#L487).
This is because reentrant backward uses nested autograd GraphTasks.
The inner GraphTask is not aware of the outer one and therefore
will flush pending `AccumulateGrad` invocations on exit, which in
turn triggers the post backward hooks registered by FSDP. Later,
the outer GraphTask will trigger that again, leading to the above
error.

PR #89791 relaxes the FSDP training state check, but we still run
into grad value check failures occasionally. Therefore, this PR only
lands the test for non-reentrant test, and we can enable the
reentrant test when the accuracy issues are addressed.

ghstack-source-id: 8848c4c
Pull Request resolved: #89781
@mrshenli
Copy link
Contributor Author

mrshenli commented Nov 29, 2022

Also, I am not sure if you want to wait for the assert relaxation PR to land and then update test_checkpoint_submodule_reentrant() or not.

Updated the PR summary to include that. Due to the grad value issue, the new tests are not testing that code path at the moment.

@mrshenli mrshenli added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 29, 2022
@mrshenli
Copy link
Contributor Author

@pytorchbot merge -g

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks on your PR pass since you used the green (-g) flag (ETA: 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Dec 10, 2022
With combining FSDP with reentrant checkpointing, the post backward
hook might run twice, and then hit [this
error](https://github.com/pytorch/pytorch/blob/e20ec44544c17d6d3d411f88b870e05043bda731/torch/distributed/fsdp/_runtime_utils.py#L487).
This is because reentrant backward uses nested autograd GraphTasks.
The inner GraphTask is not aware of the outer one and therefore
will flush pending `AccumulateGrad` invocations on exit, which in
turn triggers the post backward hooks registered by FSDP. Later,
the outer GraphTask will trigger that again, leading to the above
error.

PR pytorch#89791 relaxes the FSDP training state check, but we still run
into grad value check failures occasionally. Therefore, this PR only
lands the test for non-reentrant test, and we can enable the
reentrant test when the accuracy issues are addressed.
Pull Request resolved: pytorch#89781
Approved by: https://github.com/rohan-varma
@facebook-github-bot facebook-github-bot deleted the gh/mrshenli/342/head branch June 8, 2023 18:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants