[FSDP][optim_state_dict][8/N] Enable fully_shard optim state_dict save and load #91234

fegin · 2022-12-21T07:20:07Z

Stack from ghstack (oldest at bottom):

[FSDP][optim_state_dict][9/N] Rewrite the all-gather flow of optimizer state to support older GPUs #91343
-> [FSDP][optim_state_dict][8/N] Enable fully_shard optim state_dict save and load #91234

What does this PR do?
This PR refactor _optim_utils.py to use _FSDPState instead of FullyShardedDataParallel class. This change enables the support of optim state_dict for fully_shard.

…e and load [ghstack-poisoned]

pytorch-bot · 2022-12-21T07:20:09Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91234

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit de7f826:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…te_dict save and load" [ghstack-poisoned]

…te_dict save and load" **What does this PR do?** This PR refactor `_optim_utils.py` to use `_FSDPState` instead of `FullyShardedDataParallel` class. This change enables the support of optim state_dict for `fully_shard`. [ghstack-poisoned]

rohan-varma

LGTM! Super exciting stuff :D

rohan-varma · 2022-12-21T08:06:31Z

test/distributed/_composable/test_fully_shard.py

+    FSDPInitMode,
+    FSDPTest,
+    TransformerWithSharedParams,
+)


nit: can we have formatting changes in separate PR?

I recognize this is tricky and I think it's time to align on formatting convention for FSDP codebase and automate it. cc @awgu

My plan was to just get everyone on lintrunner and lintrunner f beginning of next half. I decided since we are cranking PRs with urgency right now, we can just not worry about it. The PR to achieve this look like: #90873 I have re-pushed recently, but the main change is just in the .lintrunner.toml file and making sure all relevant files are compliant.

I do think that unifying under lintrunner / lintrunner f is nice. Sometimes I add changes to a file that create long lines or add imports, and I want to just auto-format. However, without an agreed-upon auto-formatter, this becomes a problem and actually complicates the workflow.

Will rebase this PR on top of #91255.

rohan-varma · 2022-12-21T08:10:34Z

test/distributed/_composable/test_fully_shard.py

+        return 2
+
+    @skip_if_lt_x_gpu(2)
+    def _test_optim_state_dict_save_load(self):


Might be better to just have test instead of disabling with prefix, and adding skip decorator mentioning reason it is disabled and filing issue

rohan-varma · 2022-12-21T08:21:50Z

torch/distributed/_composable/fully_shard.py

-        ):
-            _insert_module_state(submodule, state)
+    # Insert all comm_modules to the module to state mapping.
+    for submodule in state._fully_sharded_module_to_handles.keys():


Is this change equivalent to the former code? If not, is there a reasoning we're changing the inserted states?

This is not equivalent to the former code. The reason behind the change is to only map the modules that actually have the handles -- the local root modules.

rohan-varma · 2022-12-21T08:28:17Z

torch/distributed/fsdp/_optim_utils.py

+    mapping between parameters and parameter IDs. Using ``optim_input`` is being
+    deprecated.
+
+    If the optimizer is a ``NamedOptimizer``, the optimizer state_dict does not


what if optim_input is provided but also it is a NamedOptimizer, will that create an issue?

Yes, it will fail. Will add a error handling for this.

rohan-varma · 2022-12-21T08:29:47Z

test/distributed/_composable/test_fully_shard.py

+            composable_optim_state_dict["param_groups"],
+        ):
+            for key, value in group1.items():
+                self.assertEqual(value, group2[key])


Is it worth adding tests for:

non root FSDP

DDP / replicate root

nested FSDP + non root?

Add an extra test for non root FSDP. Will add more tests after fixing the all_gather_object issue that prevent us from running tests on CI.

…te_dict save and load" **What does this PR do?** This PR refactor `_optim_utils.py` to use `_FSDPState` instead of `FullyShardedDataParallel` class. This change enables the support of optim state_dict for `fully_shard`. [ghstack-poisoned]

…e and load ghstack-source-id: 83139d5 Pull Request resolved: #91234

…onding test folders" This PR apply ufmt to format `_composable` related code. This is a request from #91234 to separate formatting changes as a new PR. [ghstack-poisoned]

… and the corresponding test folders" This PR apply ufmt to format `_composable` related code. This is a request from #91234 to separate formatting changes as a new PR. [ghstack-poisoned]

…onding test folders" This PR apply ufmt to format `_composable` related code. This is a request from #91234 to separate formatting changes as a new PR. [ghstack-poisoned]

…te_dict save and load" **What does this PR do?** This PR refactor `_optim_utils.py` to use `_FSDPState` instead of `FullyShardedDataParallel` class. This change enables the support of optim state_dict for `fully_shard`. [ghstack-poisoned]

…e and load ghstack-source-id: c89755c Pull Request resolved: #91234

…te_dict save and load" **What does this PR do?** This PR refactor `_optim_utils.py` to use `_FSDPState` instead of `FullyShardedDataParallel` class. This change enables the support of optim state_dict for `fully_shard`. [ghstack-poisoned]

… folders (#91255) This PR apply ufmt to format `_composable` related code. This is a request from #91234 to separate formatting changes as a new PR. Pull Request resolved: #91255 Approved by: https://github.com/awgu

…te_dict save and load" **What does this PR do?** This PR refactor `_optim_utils.py` to use `_FSDPState` instead of `FullyShardedDataParallel` class. This change enables the support of optim state_dict for `fully_shard`. [ghstack-poisoned]

fegin · 2022-12-29T20:01:13Z

@pytorchbot merge

pytorchmergebot · 2022-12-29T20:03:55Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2022-12-29T20:09:02Z

Merge failed

Reason: 2 additional jobs have failed, first few of them are: trunk ,trunk / linux-focal-rocm5.3-py3.8 / test (default, 2, 2, linux.rocm.gpu)

Details for Dev Infra team

Raised by workflow job

fegin · 2022-12-30T06:53:12Z

@pytorchbot merge

pytorchmergebot · 2022-12-30T06:56:40Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

[FSDP][optim_state_dict][8/N] Enable fully_shard optim state_dict sav…

19fe72b

…e and load [ghstack-poisoned]

fegin requested review from H-Huang, awgu, kwen2501, mrshenli, pritamdamania87, rohan-varma, wanchaol, yhcharles and zhaojuanmao as code owners December 21, 2022 07:20

fegin mentioned this pull request Dec 21, 2022

[FSDP][optim_state_dict][6/N] Refactor the optim_state_dict APIs to support hooks #90798

Closed

fegin mentioned this pull request Dec 21, 2022

[FSDP][optim_state_dict][7/N] Make FSDP support NamedOptimizer #91160

Closed

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Dec 21, 2022

Update on "[FSDP][optim_state_dict][8/N] Enable fully_shard optim sta…

5355610

…te_dict save and load" [ghstack-poisoned]

rohan-varma approved these changes Dec 21, 2022

View reviewed changes

fegin added a commit that referenced this pull request Dec 21, 2022

[FSDP][optim_state_dict][8/N] Enable fully_shard optim state_dict sav…

72467b1

…e and load ghstack-source-id: 83139d5 Pull Request resolved: #91234

fegin mentioned this pull request Dec 21, 2022

[Composable API] Apply ufmt to _composable and the corresponding test folders #91255

Closed

fegin added a commit that referenced this pull request Dec 22, 2022

[FSDP][optim_state_dict][8/N] Enable fully_shard optim state_dict sav…

d2f0c51

…e and load ghstack-source-id: c89755c Pull Request resolved: #91234

fegin mentioned this pull request Dec 23, 2022

[FSDP][optim_state_dict][9/N] Rewrite the all-gather flow of optimizer state to support older GPUs #91343

Closed

fegin added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 29, 2022

pytorchmergebot added the Merged label Dec 30, 2022

pytorchmergebot closed this in 0e8565d Dec 30, 2022

facebook-github-bot deleted the gh/fegin/55/head branch June 8, 2023 17:16

[FSDP][optim_state_dict][8/N] Enable fully_shard optim state_dict save and load #91234

[FSDP][optim_state_dict][8/N] Enable fully_shard optim state_dict save and load #91234

Uh oh!

Conversation

fegin commented Dec 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91234

✅ No Failures

Uh oh!

rohan-varma left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rohan-varma Dec 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fegin commented Dec 29, 2022

Uh oh!

pytorchmergebot commented Dec 29, 2022

Merge started

Uh oh!

pytorchmergebot commented Dec 29, 2022

Merge failed

Uh oh!

fegin commented Dec 30, 2022

Uh oh!

pytorchmergebot commented Dec 30, 2022

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

fegin commented Dec 21, 2022 •

edited

Loading

pytorch-bot bot commented Dec 21, 2022 •

edited

Loading

rohan-varma Dec 21, 2022 •

edited

Loading