Test FSDP with submodule non-reentrant checkpointing #89781

mrshenli · 2022-11-28T20:33:07Z

Stack from ghstack (oldest at bottom):

-> Test FSDP with submodule non-reentrant checkpointing #89781

With combining FSDP with reentrant checkpointing, the post backward
hook might run twice, and then hit this
error.
This is because reentrant backward uses nested autograd GraphTasks.
The inner GraphTask is not aware of the outer one and therefore
will flush pending AccumulateGrad invocations on exit, which in
turn triggers the post backward hooks registered by FSDP. Later,
the outer GraphTask will trigger that again, leading to the above
error.

PR #89791 relaxes the FSDP training state check, but we still run
into grad value check failures occasionally. Therefore, this PR only
lands the test for non-reentrant test, and we can enable the
reentrant test when the accuracy issues are addressed.

[ghstack-poisoned]

pytorch-bot · 2022-11-28T20:33:10Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/89781

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 73c7752:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

test/distributed/fsdp/test_fsdp_checkpoint.py

awgu

Thanks for adding these tests!

At a meta level, I am wondering how we should approach testing more interleavings in a systematic and complete way.

Also, I am not sure if you want to wait for the assert relaxation PR to land and then update test_checkpoint_submodule_reentrant() or not.

test/distributed/fsdp/test_fsdp_checkpoint.py

rohan-varma

LGTM, stamping to unblock. Will file a follow up issue to debug why this FSDP + AC structure does not work.

test/distributed/fsdp/test_fsdp_checkpoint.py

With combining FSDP with reentrant checkpointing, the post backward hook might run twice, and then hit [this error](https://github.com/pytorch/pytorch/blob/e20ec44544c17d6d3d411f88b870e05043bda731/torch/distributed/fsdp/_runtime_utils.py#L487). This is because reentrant backward uses nested autograd GraphTasks. The inner GraphTask is not aware of the outer one and therefore will flush pending `AccumulateGrad` invocations on exit, which in turn triggers the post backward hooks registered by FSDP. Later, the outer GraphTask will trigger that again, leading to the above error. PR #89791 relaxes the FSDP training state check, but we still run into grad value check failures occasionally. Therefore, this PR only lands the test for non-reentrant test, and we can enable the reentrant test when the accuracy issues are addressed. [ghstack-poisoned]

With combining FSDP with reentrant checkpointing, the post backward hook might run twice, and then hit [this error](https://github.com/pytorch/pytorch/blob/e20ec44544c17d6d3d411f88b870e05043bda731/torch/distributed/fsdp/_runtime_utils.py#L487). This is because reentrant backward uses nested autograd GraphTasks. The inner GraphTask is not aware of the outer one and therefore will flush pending `AccumulateGrad` invocations on exit, which in turn triggers the post backward hooks registered by FSDP. Later, the outer GraphTask will trigger that again, leading to the above error. PR #89791 relaxes the FSDP training state check, but we still run into grad value check failures occasionally. Therefore, this PR only lands the test for non-reentrant test, and we can enable the reentrant test when the accuracy issues are addressed. ghstack-source-id: 8848c4c Pull Request resolved: #89781

mrshenli · 2022-11-29T02:37:10Z

Also, I am not sure if you want to wait for the assert relaxation PR to land and then update test_checkpoint_submodule_reentrant() or not.

Updated the PR summary to include that. Due to the grad value issue, the new tests are not testing that code path at the moment.

mrshenli · 2022-11-29T04:37:15Z

@pytorchbot merge -g

pytorchmergebot · 2022-11-29T04:38:56Z

Merge started

Your change will be merged once all checks on your PR pass since you used the green (-g) flag (ETA: 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

With combining FSDP with reentrant checkpointing, the post backward hook might run twice, and then hit [this error](https://github.com/pytorch/pytorch/blob/e20ec44544c17d6d3d411f88b870e05043bda731/torch/distributed/fsdp/_runtime_utils.py#L487). This is because reentrant backward uses nested autograd GraphTasks. The inner GraphTask is not aware of the outer one and therefore will flush pending `AccumulateGrad` invocations on exit, which in turn triggers the post backward hooks registered by FSDP. Later, the outer GraphTask will trigger that again, leading to the above error. PR pytorch#89791 relaxes the FSDP training state check, but we still run into grad value check failures occasionally. Therefore, this PR only lands the test for non-reentrant test, and we can enable the reentrant test when the accuracy issues are addressed. Pull Request resolved: pytorch#89781 Approved by: https://github.com/rohan-varma

Test FSDP with submodule non-reentrant checkpointing

6cc52c1

[ghstack-poisoned]

mrshenli requested review from H-Huang, awgu, kwen2501, pritamdamania87, rohan-varma and zhaojuanmao as code owners November 28, 2022 20:33

pytorch-bot bot added the topic: not user facing topic category label Nov 28, 2022

Update on "Test FSDP with submodule non-reentrant checkpointing"

475baa5

[ghstack-poisoned]

Update on "Test FSDP with submodule non-reentrant checkpointing"

6319739

[ghstack-poisoned]

rohan-varma reviewed Nov 28, 2022

View reviewed changes

test/distributed/fsdp/test_fsdp_checkpoint.py Outdated Show resolved Hide resolved

awgu reviewed Nov 28, 2022

View reviewed changes

test/distributed/fsdp/test_fsdp_checkpoint.py Show resolved Hide resolved

test/distributed/fsdp/test_fsdp_checkpoint.py Show resolved Hide resolved

test/distributed/fsdp/test_fsdp_checkpoint.py Show resolved Hide resolved

rohan-varma approved these changes Nov 28, 2022

View reviewed changes

test/distributed/fsdp/test_fsdp_checkpoint.py Outdated Show resolved Hide resolved

test/distributed/fsdp/test_fsdp_checkpoint.py Outdated Show resolved Hide resolved

test/distributed/fsdp/test_fsdp_checkpoint.py Outdated Show resolved Hide resolved

mrshenli added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 29, 2022

pytorchmergebot added the Merged label Nov 29, 2022

pytorchmergebot closed this in 7ec7a82 Nov 29, 2022

facebook-github-bot deleted the gh/mrshenli/342/head branch June 8, 2023 18:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Test FSDP with submodule non-reentrant checkpointing #89781

Test FSDP with submodule non-reentrant checkpointing #89781

Uh oh!

mrshenli commented Nov 28, 2022 •

edited

Loading

Uh oh!

pytorch-bot bot commented Nov 28, 2022 •

edited

Loading

Uh oh!

Uh oh!

awgu left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rohan-varma left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mrshenli commented Nov 29, 2022 •

edited

Loading

Uh oh!

mrshenli commented Nov 29, 2022

Uh oh!

pytorchmergebot commented Nov 29, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Test FSDP with submodule non-reentrant checkpointing #89781

Test FSDP with submodule non-reentrant checkpointing #89781

Uh oh!

Conversation

mrshenli commented Nov 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/89781

✅ No Failures

Uh oh!

Uh oh!

awgu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rohan-varma left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mrshenli commented Nov 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrshenli commented Nov 29, 2022

Uh oh!

pytorchmergebot commented Nov 29, 2022

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mrshenli commented Nov 28, 2022 •

edited

Loading

pytorch-bot bot commented Nov 28, 2022 •

edited

Loading

mrshenli commented Nov 29, 2022 •

edited

Loading