[FSDP] Relax post-backward assert #89791

awgu · 2022-11-28T22:01:21Z

Stack from ghstack (oldest at bottom):

-> [FSDP] Relax post-backward assert #89791

This assert was accidentally made stricter when transitioning from per-FSDP-instance training state to per-handle training state. This PR relaxes it again, which should restore compatibility for some reentrant AC plus FSDP cases.

[ghstack-poisoned]

pytorch-bot · 2022-11-28T22:01:23Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/89791

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit a510d8d:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

zhaojuanmao · 2022-11-28T22:03:55Z

torch/distributed/fsdp/_runtime_utils.py

        "FullyShardedDataParallel._post_backward_hook"
    ):
        _assert_in_training_states(state, [TrainingState.FORWARD_BACKWARD])
+        # For reentrant AC, the post-backward hook may run multiple times in


nit: For reentrant AC multiple times

This assert was accidentally made stricter when transitioning from per-FSDP-instance training state to per-handle training state. This PR relaxes it again, which should restore compatibility for some reentrant AC plus FSDP cases. [ghstack-poisoned]

awgu · 2022-11-28T22:16:25Z

@pytorchbot rebase -s

pytorchmergebot · 2022-11-28T22:21:26Z

@pytorchbot successfully started a rebase job. Check the current status here

This assert was accidentally made stricter when transitioning from per-FSDP-instance training state to per-handle training state. This PR relaxes it again, which should restore compatibility for some reentrant AC plus FSDP cases. [ghstack-poisoned]

pytorchmergebot · 2022-11-28T22:21:46Z

Successfully rebased gh/awgu/218/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/89791)

ghstack-source-id: 268645d Pull Request resolved: #89791

awgu · 2022-11-29T01:48:39Z

@pytorchbot merge

pytorchmergebot · 2022-11-29T01:50:12Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

With combining FSDP with reentrant checkpointing, the post backward hook might run twice, and then hit [this error](https://github.com/pytorch/pytorch/blob/e20ec44544c17d6d3d411f88b870e05043bda731/torch/distributed/fsdp/_runtime_utils.py#L487). This is because reentrant backward uses nested autograd GraphTasks. The inner GraphTask is not aware of the outer one and therefore will flush pending `AccumulateGrad` invocations on exit, which in turn triggers the post backward hooks registered by FSDP. Later, the outer GraphTask will trigger that again, leading to the above error. PR #89791 relaxes the FSDP training state check, but we still run into grad value check failures occasionally. Therefore, this PR only lands the test for non-reentrant test, and we can enable the reentrant test when the accuracy issues are addressed. [ghstack-poisoned]

With combining FSDP with reentrant checkpointing, the post backward hook might run twice, and then hit [this error](https://github.com/pytorch/pytorch/blob/e20ec44544c17d6d3d411f88b870e05043bda731/torch/distributed/fsdp/_runtime_utils.py#L487). This is because reentrant backward uses nested autograd GraphTasks. The inner GraphTask is not aware of the outer one and therefore will flush pending `AccumulateGrad` invocations on exit, which in turn triggers the post backward hooks registered by FSDP. Later, the outer GraphTask will trigger that again, leading to the above error. PR #89791 relaxes the FSDP training state check, but we still run into grad value check failures occasionally. Therefore, this PR only lands the test for non-reentrant test, and we can enable the reentrant test when the accuracy issues are addressed. ghstack-source-id: 8848c4c Pull Request resolved: #89781

pytorchmergebot · 2022-11-29T04:01:25Z

Merge failed

Reason: 1 additional jobs have failed, first few of them are: trunk

Details for Dev Infra team

Raised by workflow job

With combining FSDP with reentrant checkpointing, the post backward hook might run twice, and then hit [this error](https://github.com/pytorch/pytorch/blob/e20ec44544c17d6d3d411f88b870e05043bda731/torch/distributed/fsdp/_runtime_utils.py#L487). This is because reentrant backward uses nested autograd GraphTasks. The inner GraphTask is not aware of the outer one and therefore will flush pending `AccumulateGrad` invocations on exit, which in turn triggers the post backward hooks registered by FSDP. Later, the outer GraphTask will trigger that again, leading to the above error. PR #89791 relaxes the FSDP training state check, but we still run into grad value check failures occasionally. Therefore, this PR only lands the test for non-reentrant test, and we can enable the reentrant test when the accuracy issues are addressed. Pull Request resolved: #89781 Approved by: https://github.com/rohan-varma

awgu · 2022-11-29T17:22:42Z

@pytorchbot merge

pytorchmergebot · 2022-11-29T17:25:51Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

With combining FSDP with reentrant checkpointing, the post backward hook might run twice, and then hit [this error](https://github.com/pytorch/pytorch/blob/e20ec44544c17d6d3d411f88b870e05043bda731/torch/distributed/fsdp/_runtime_utils.py#L487). This is because reentrant backward uses nested autograd GraphTasks. The inner GraphTask is not aware of the outer one and therefore will flush pending `AccumulateGrad` invocations on exit, which in turn triggers the post backward hooks registered by FSDP. Later, the outer GraphTask will trigger that again, leading to the above error. PR pytorch#89791 relaxes the FSDP training state check, but we still run into grad value check failures occasionally. Therefore, this PR only lands the test for non-reentrant test, and we can enable the reentrant test when the accuracy issues are addressed. Pull Request resolved: pytorch#89781 Approved by: https://github.com/rohan-varma

This assert was accidentally made stricter when transitioning from per-FSDP-instance training state to per-handle training state. This PR relaxes it again, which should restore compatibility for some reentrant AC plus FSDP cases. Pull Request resolved: pytorch#89791 Approved by: https://github.com/zhaojuanmao

[FSDP] Relax post-backward assert

d5abcd9

[ghstack-poisoned]

awgu requested review from H-Huang, kwen2501, mrshenli, pritamdamania87, rohan-varma and zhaojuanmao as code owners November 28, 2022 22:01

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Nov 28, 2022

zhaojuanmao approved these changes Nov 28, 2022

View reviewed changes

awgu added the topic: improvements topic category label Nov 28, 2022

pytorchmergebot pushed a commit that referenced this pull request Nov 28, 2022

[FSDP] Relax post-backward assert

c9e410e

ghstack-source-id: 268645d Pull Request resolved: #89791

awgu added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 28, 2022

mrshenli mentioned this pull request Nov 29, 2022

Test FSDP with submodule non-reentrant checkpointing #89781

Closed

pytorchmergebot added the Merged label Nov 29, 2022

pytorchmergebot closed this in 6e2da42 Nov 29, 2022

facebook-github-bot deleted the gh/awgu/218/head branch June 8, 2023 15:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FSDP] Relax post-backward assert #89791

[FSDP] Relax post-backward assert #89791

Uh oh!

awgu commented Nov 28, 2022 •

edited by pytorchmergebot

Loading

Uh oh!

pytorch-bot bot commented Nov 28, 2022 •

edited

Loading

Uh oh!

zhaojuanmao Nov 28, 2022

Uh oh!

awgu commented Nov 28, 2022

Uh oh!

pytorchmergebot commented Nov 28, 2022

Uh oh!

pytorchmergebot commented Nov 28, 2022

Uh oh!

awgu commented Nov 29, 2022

Uh oh!

pytorchmergebot commented Nov 29, 2022

Uh oh!

pytorchmergebot commented Nov 29, 2022

Uh oh!

awgu commented Nov 29, 2022

Uh oh!

pytorchmergebot commented Nov 29, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[FSDP] Relax post-backward assert #89791

[FSDP] Relax post-backward assert #89791

Uh oh!

Conversation

awgu commented Nov 28, 2022 • edited by pytorchmergebot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/89791

✅ No Failures

Uh oh!

zhaojuanmao Nov 28, 2022

Choose a reason for hiding this comment

Uh oh!

awgu commented Nov 28, 2022

Uh oh!

pytorchmergebot commented Nov 28, 2022

Uh oh!

pytorchmergebot commented Nov 28, 2022

Uh oh!

awgu commented Nov 29, 2022

Uh oh!

pytorchmergebot commented Nov 29, 2022

Merge started

Uh oh!

pytorchmergebot commented Nov 29, 2022

Merge failed

Uh oh!

awgu commented Nov 29, 2022

Uh oh!

pytorchmergebot commented Nov 29, 2022

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

awgu commented Nov 28, 2022 •

edited by pytorchmergebot

Loading

pytorch-bot bot commented Nov 28, 2022 •

edited

Loading