[FSDP][1/N] Refactor module materialization #94196

awgu · 2023-02-06T17:48:07Z

Stack from ghstack:

[FSDP][3/N] Add LCA logic to fully_shard #94198 [FSDP][3/N] Add LCA logic to fully_shard
[FSDP][2/N] Add util for computing shared param LCA #94197 [FSDP][2/N] Add util for computing shared param LCA
[FSDP][1/N] Refactor module materialization #94196 [FSDP][1/N] Refactor module materialization

Overview
This refactors module materialization (i.e. meta device or torchdistX deferred initialization) to compute the parameter and buffer names as needed instead of pre-computing them. These are needed to reacquire references to the states (e.g. module.get_parameter(param_name)) after materialization since the materialization may create new variables.

This refactor simplifies _get_fully_sharded_module_to_states() (the core function for "pseudo auto wrapping") to better enable lowest common ancestor (LCA) module computation for shared parameters, for which tracking parameter and buffer names may complicate the already non-obvious implementation.

Discussion
The tradeoff is a worst case quadratic traversal over modules if materializing all of them. However, since (1) the number of modules is relatively small, (2) the computation per module in the quadratic traversal is negligible, (3) this runs only once per training session, and (4) module materialization targets truly large models, I think this tradeoff is tolerable.

For Reviewers

_init_param_handle_from_module() initializes one FlatParamHandle from a fully sharded module and represents the module wrapper code path. For this code path, there is no need to reacquire references to the parameters/buffers for now since the managed parameters are only computed after materialization. This works because the managed parameters have a simple definition: any parameter in the local root module's tree excluding those already marked as flattened by FSDP. Similarly, FSDP marks buffers to indicate that they have already been processed (synced if sync_module_states).
_init_param_handles_from_module() initializes all FlatParamHandles from a fully sharded module and represents the composable code path. For this code path, we must reacquire references to parameters/buffers because each logical wrapping is specified as a list of parameters/buffers to group together by those variables and because materialization may create new variables.

[ghstack-poisoned]

pytorch-bot · 2023-02-06T17:48:11Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/94196

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 221ef73:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: d064a75 Pull Request resolved: pytorch#94196

torch/distributed/fsdp/_init_utils.py

rohan-varma

LGTM thanks!

rohan-varma · 2023-02-10T23:29:30Z

torch/distributed/fsdp/_init_utils.py

+        if is_meta_module or is_torchdistX_deferred_init:
+            materialized_module = True
+            # Save the parameter and buffer names to reacquire references after
+            # after materialization since their variables may change


Even after reading the PR description, I'm not 100% sure why the variables may change after materialization? I thought materialization is all about filling in meta parameters with their actual values?

Yes, but at the implementation level, module.to(device) will replace all meta-device parameters with new Python parameter variables.

This function returns False for meta-device tensors:

pytorch/torch/nn/modules/module.py

Lines 799 to 800 in a064ce1

def compute_should_use_set_data(tensor, tensor_applied):

if torch._has_compatible_shallow_copy_type(tensor, tensor_applied):

This leads to the else branch:

pytorch/torch/nn/modules/module.py

Lines 821 to 829 in a064ce1

should_use_set_data = compute_should_use_set_data(param, param_applied)

if should_use_set_data:

param.data = param_applied

out_param = param

else:

assert isinstance(param, Parameter)

assert param.is_leaf

out_param = Parameter(param_applied, param.requires_grad)

self._parameters[key] = out_param

rohan-varma · 2023-02-10T23:30:23Z

torch/distributed/fsdp/_init_utils.py

+        elif is_meta_module:
+            _materialize_meta_module(fully_sharded_module, device_id)
+        elif is_torchdistX_deferred_init:
+            deferred_init.materialize_module(


do we have unittests covering deferred init for the composable path?

We do not because torchdistX does not support the latest PyTorch version, so I could not run torchdistX locally.

**Overview** This refactors module materialization (i.e. meta device or `torchdistX` deferred initialization) to compute the parameter and buffer names as needed instead of pre-computing them. These are needed to reacquire references to the states (e.g. `module.get_parameter(param_name)`) after materialization since the materialization may create new variables. This refactor simplifies `_get_fully_sharded_module_to_states()` (the core function for "pseudo auto wrapping") to better enable lowest common ancestor (LCA) module computation for shared parameters, for which tracking parameter and buffer names may complicate the already non-obvious implementation. **Discussion** The tradeoff is a worst case quadratic traversal over modules if materializing all of them. However, since (1) the number of modules is relatively small, (2) the computation per module in the quadratic traversal is negligible, (3) this runs only once per training session, and (4) module materialization targets truly large models, I think this tradeoff is tolerable. **For Reviewers** - `_init_param_handle_from_module()` initializes _one_ `FlatParamHandle` from a fully sharded module and represents the module wrapper code path. For this code path, there is no need to reacquire references to the parameters/buffers for now since the managed parameters are only computed after materialization. This works because the managed parameters have a simple definition: any parameter in the local root module's tree excluding those already marked as flattened by FSDP. Similarly, FSDP marks buffers to indicate that they have already been processed (synced if `sync_module_states`). - `_init_param_handles_from_module()` initializes _all_ `FlatParamHandle`s from a fully sharded module and represents the composable code path. For this code path, we must reacquire references to parameters/buffers because each logical wrapping is specified as a list of parameters/buffers to group together by those variables and because materialization may create new variables. [ghstack-poisoned]

ghstack-source-id: 3563d70 Pull Request resolved: pytorch#94196

awgu · 2023-02-13T21:40:38Z

@pytorchbot merge

pytorchmergebot · 2023-02-13T21:42:55Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

[FSDP][1/N] Refactor module materialization

b215ad9

[ghstack-poisoned]

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Feb 6, 2023

This was referenced Feb 6, 2023

[FSDP][2/N] Add util for computing shared param LCA #94197

Closed

[FSDP][3/N] Add LCA logic to fully_shard #94198

Closed

awgu pushed a commit to awgu/pytorch that referenced this pull request Feb 6, 2023

[FSDP][1/N] Refactor module materialization

9cba5f6

ghstack-source-id: d064a75 Pull Request resolved: pytorch#94196

awgu added the topic: not user facing topic category label Feb 6, 2023

awgu marked this pull request as ready for review February 6, 2023 21:37

awgu requested review from H-Huang, kwen2501, mrshenli, rohan-varma, wanchaol and zhaojuanmao as code owners February 6, 2023 21:37

rohan-varma reviewed Feb 10, 2023

View reviewed changes

torch/distributed/fsdp/_init_utils.py Show resolved Hide resolved

rohan-varma approved these changes Feb 10, 2023

View reviewed changes

awgu mentioned this pull request Feb 13, 2023

Revert "[follow-up] Python Attr Serialization (#88913)" #94741

Closed

awgu pushed a commit to awgu/pytorch that referenced this pull request Feb 13, 2023

[FSDP][1/N] Refactor module materialization

8f2aff2

ghstack-source-id: 3563d70 Pull Request resolved: pytorch#94196

awgu requested a review from fegin as a code owner February 13, 2023 17:15

awgu added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 13, 2023

pytorchmergebot added the Merged label Feb 13, 2023

pytorchmergebot closed this in 5ee230f Feb 13, 2023

facebook-github-bot deleted the gh/awgu/322/head branch June 8, 2023 15:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FSDP][1/N] Refactor module materialization #94196

[FSDP][1/N] Refactor module materialization #94196

Uh oh!

awgu commented Feb 6, 2023 •

edited

Loading

Uh oh!

pytorch-bot bot commented Feb 6, 2023 •

edited

Loading

Uh oh!

Uh oh!

rohan-varma left a comment

Uh oh!

rohan-varma Feb 10, 2023

Uh oh!

awgu Feb 13, 2023

Uh oh!

rohan-varma Feb 10, 2023

Uh oh!

awgu Feb 13, 2023

Uh oh!

awgu commented Feb 13, 2023

Uh oh!

pytorchmergebot commented Feb 13, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	def compute_should_use_set_data(tensor, tensor_applied):
	if torch._has_compatible_shallow_copy_type(tensor, tensor_applied):

	should_use_set_data = compute_should_use_set_data(param, param_applied)
	if should_use_set_data:
	param.data = param_applied
	out_param = param
	else:
	assert isinstance(param, Parameter)
	assert param.is_leaf
	out_param = Parameter(param_applied, param.requires_grad)
	self._parameters[key] = out_param

[FSDP][1/N] Refactor module materialization #94196

[FSDP][1/N] Refactor module materialization #94196

Uh oh!

Conversation

awgu commented Feb 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Feb 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/94196

✅ No Failures

Uh oh!

Uh oh!

rohan-varma left a comment

Choose a reason for hiding this comment

Uh oh!

rohan-varma Feb 10, 2023

Choose a reason for hiding this comment

Uh oh!

awgu Feb 13, 2023

Choose a reason for hiding this comment

Uh oh!

rohan-varma Feb 10, 2023

Choose a reason for hiding this comment

Uh oh!

awgu Feb 13, 2023

Choose a reason for hiding this comment

Uh oh!

awgu commented Feb 13, 2023

Uh oh!

pytorchmergebot commented Feb 13, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

awgu commented Feb 6, 2023 •

edited

Loading

pytorch-bot bot commented Feb 6, 2023 •

edited

Loading