[FSDP] Add re-key btw param names/IDs for optim state dict #74912

awgu · 2022-03-29T17:15:56Z

Stack from ghstack:

[FSDP] Add re-key btw param names/IDs for optim state dict #74912 [FSDP] Add re-key btw param names/IDs for optim state dict
[FSDP] Optim state chkpt: key by param name, not ID #74879 [FSDP] Optim state chkpt: key by param name, not ID
[FSDP] Add full optim state dict #74215 [FSDP] Add full optim state dict
[Easy][FSDP] (Reland) Doc fixes #74834 [Easy][FSDP] (Reland) Doc fixes

Overview
This introduces a new static method FSDP.rekey_optim_state_dict() as a utility for interoperating between local/DDP (non-wrapped) models and FSDP (wrapped) models.

To load from a wrapped model to a non-wrapped model:

wrapped_model, wrapped_optim = ...
full_osd = FSDP.full_optim_state_dict(wrapped_model, wrapped_optim)
nonwrapped_model, nonwrapped_optim = ...
rekeyed_osd = FSDP.rekey_optim_state_dict(full_osd, OptimStateKeyType.PARAM_ID, nonwrapped_model)
nonwrapped_optim.load_state_dict(rekeyed_osd)

To load from a non-wrapped model to a wrapped model:

nonwrapped_model, nonwrapped_optim = ...
osd = nonwrapped_optim.state_dict()
rekeyed_osd = FSDP.rekey_optim_state_dict(osd, OptimStateKeyType.PARAM_NAME, nonwrapped_model)
wrapped_model, wrapped_optim = ...
sharded_osd = FSDP.shard_full_optim_state_dict(rekeyed_osd, wrapped_model)
wrapped_optim.load_state_dict(sharded_osd)

Test Plan
test_rekey_optim_state_dict_to_ids() and test_rekey_optim_state_dict_to_names().

Differential Revision: D35225819

[ghstack-poisoned]

facebook-github-bot · 2022-03-29T17:16:02Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/74912
Need help or want to give feedback on the CI? Visit our office hours

💊 CI failures summary and remediations

As of commit d9b02e1 (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

awgu · 2022-03-29T17:21:33Z

torch/distributed/fsdp/optim_utils.py



-def _get_flat_param_id_to_param(
+def _get_param_id_to_param(


Renaming this because it actually works for parameter ID to parameter in both the flattened and unflattened cases, as long as the keys and values are consistent.

awgu · 2022-03-29T17:22:21Z

torch/distributed/fsdp/fully_sharded_data_parallel.py

        }
        return sharded_optim_state_dict

+    @staticmethod


This functionality is actually not unique to FSDP. It could be used generally for PyTorch. However, nowhere else have we seen keying by parameter name, so I have put this inside fully_sharded_data_parallel.py for now.

sounds good!

awgu · 2022-03-29T17:23:12Z

torch/distributed/fsdp/optim_utils.py

            state).
    """
    non_none_tensors = [t for t in pos_dim_tensors if t is not None]
-    # Check that all are tensors on CPU with the same dtype


Just removing this check, which is overly strict. We can just move the tensors to CPU inside the function to avoid device mismatch. It is not actually semantically important which device the tensors in the optimizer state are on.

zhaojuanmao

looks great!

zhaojuanmao · 2022-03-29T18:19:09Z

torch/distributed/fsdp/fully_sharded_data_parallel.py

        }
        return sharded_optim_state_dict

+    @staticmethod


sounds good!

zhaojuanmao · 2022-03-29T18:20:04Z

torch/distributed/fsdp/fully_sharded_data_parallel.py

    SHARDED_STATE_DICT = auto()


+class OptimStateKeyType(Enum):


nit: let's export it in fspd/init.py file as well?

awgu · 2022-03-29T19:28:31Z

@awgu has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

**Overview** This introduces a new static method `FSDP.rekey_optim_state_dict()` as a utility for interoperating between local/DDP (non-wrapped) models and FSDP (wrapped) models. To load from a wrapped model to a non-wrapped model: ``` wrapped_model, wrapped_optim = ... full_osd = FSDP.full_optim_state_dict(wrapped_model, wrapped_optim) nonwrapped_model, nonwrapped_optim = ... rekeyed_osd = FSDP.rekey_optim_state_dict(full_osd, OptimStateKeyType.PARAM_ID, nonwrapped_model) nonwrapped_optim.load_state_dict(rekeyed_osd) ``` To load from a non-wrapped model to a wrapped model: ``` nonwrapped_model, nonwrapped_optim = ... osd = nonwrapped_optim.state_dict() rekeyed_osd = FSDP.rekey_optim_state_dict(osd, OptimStateKeyType.PARAM_NAME, nonwrapped_model) wrapped_model, wrapped_optim = ... sharded_osd = FSDP.shard_full_optim_state_dict(rekeyed_osd, wrapped_model) wrapped_optim.load_state_dict(sharded_osd) ``` **Test Plan** `test_rekey_optim_state_dict_to_ids()` and `test_rekey_optim_state_dict_to_names()`. Differential Revision: [D35225819](https://our.internmc.facebook.com/intern/diff/D35225819) [ghstack-poisoned]

ghstack-source-id: f79c9d9 Pull Request resolved: #74912

awgu · 2022-03-29T19:30:29Z

@awgu has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: Pull Request resolved: #74912 **Overview** This introduces a new static method `FSDP.rekey_optim_state_dict()` as a utility for interoperating between local/DDP (non-wrapped) models and FSDP (wrapped) models. To load from a wrapped model to a non-wrapped model: ``` wrapped_model, wrapped_optim = ... full_osd = FSDP.full_optim_state_dict(wrapped_model, wrapped_optim) nonwrapped_model, nonwrapped_optim = ... rekeyed_osd = FSDP.rekey_optim_state_dict(full_osd, OptimStateKeyType.PARAM_ID, nonwrapped_model) nonwrapped_optim.load_state_dict(rekeyed_osd) ``` To load from a non-wrapped model to a wrapped model: ``` nonwrapped_model, nonwrapped_optim = ... osd = nonwrapped_optim.state_dict() rekeyed_osd = FSDP.rekey_optim_state_dict(osd, OptimStateKeyType.PARAM_NAME, nonwrapped_model) wrapped_model, wrapped_optim = ... sharded_osd = FSDP.shard_full_optim_state_dict(rekeyed_osd, wrapped_model) wrapped_optim.load_state_dict(sharded_osd) ``` **Test Plan** `test_rekey_optim_state_dict_to_ids()` and `test_rekey_optim_state_dict_to_names()`. Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D35225819 Pulled By: awgu fbshipit-source-id: fbbdbde8b595a9c65b17a9aecb4f22b2c9761a23

github-actions · 2022-03-30T14:16:03Z

Hey @awgu.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

[FSDP] Add re-key btw param names/IDs for optim state dict

b8271f6

[ghstack-poisoned]

awgu mentioned this pull request Mar 29, 2022

[Easy][FSDP] (Reland) Doc fixes #74834

Closed

facebook-github-bot added the cla signed label Mar 29, 2022

This was referenced Mar 29, 2022

[FSDP] Add full optim state dict #74215

Closed

[FSDP] Optim state chkpt: key by param name, not ID #74879

Closed

facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Mar 29, 2022

awgu requested review from fegin, rohan-varma and zhaojuanmao March 29, 2022 17:19

awgu marked this pull request as ready for review March 29, 2022 17:19

awgu requested review from H-Huang, mingzhe09088, mrshenli and pritamdamania87 as code owners March 29, 2022 17:19

awgu commented Mar 29, 2022

View reviewed changes

zhaojuanmao approved these changes Mar 29, 2022

View reviewed changes

awgu pushed a commit that referenced this pull request Mar 29, 2022

[FSDP] Add re-key btw param names/IDs for optim state dict

0e0ac20

ghstack-source-id: f79c9d9 Pull Request resolved: #74912

pytorchmergebot closed this in 9f4e7de Mar 30, 2022

facebook-github-bot deleted the gh/awgu/20/head branch April 3, 2022 14:16

WBobby mentioned this pull request Aug 17, 2022

Add ROCm5.2.3/AMDGPU support for PyTorch WBobby/pytorch#2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FSDP] Add re-key btw param names/IDs for optim state dict #74912

[FSDP] Add re-key btw param names/IDs for optim state dict #74912

Uh oh!

awgu commented Mar 29, 2022 •

edited

Loading

Uh oh!

facebook-github-bot commented Mar 29, 2022 •

edited

Loading

Uh oh!

awgu Mar 29, 2022

Uh oh!

awgu Mar 29, 2022

Uh oh!

zhaojuanmao Mar 29, 2022

Uh oh!

awgu Mar 29, 2022

Uh oh!

zhaojuanmao left a comment

Uh oh!

zhaojuanmao Mar 29, 2022

Uh oh!

zhaojuanmao Mar 29, 2022

Uh oh!

awgu commented Mar 29, 2022

Uh oh!

awgu commented Mar 29, 2022

Uh oh!

github-actions bot commented Mar 30, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants



		def _get_flat_param_id_to_param(
		def _get_param_id_to_param(

[FSDP] Add re-key btw param names/IDs for optim state dict #74912

[FSDP] Add re-key btw param names/IDs for optim state dict #74912

Uh oh!

Conversation

awgu commented Mar 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Mar 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

💊 CI failures summary and remediations

Uh oh!

awgu Mar 29, 2022

Choose a reason for hiding this comment

Uh oh!

awgu Mar 29, 2022

Choose a reason for hiding this comment

Uh oh!

zhaojuanmao Mar 29, 2022

Choose a reason for hiding this comment

Uh oh!

awgu Mar 29, 2022

Choose a reason for hiding this comment

Uh oh!

zhaojuanmao left a comment

Choose a reason for hiding this comment

Uh oh!

zhaojuanmao Mar 29, 2022

Choose a reason for hiding this comment

Uh oh!

zhaojuanmao Mar 29, 2022

Choose a reason for hiding this comment

Uh oh!

awgu commented Mar 29, 2022

Uh oh!

awgu commented Mar 29, 2022

Uh oh!

github-actions bot commented Mar 30, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

awgu commented Mar 29, 2022 •

edited

Loading

facebook-github-bot commented Mar 29, 2022 •

edited

Loading