Skip to content

Conversation

@awgu
Copy link
Collaborator

@awgu awgu commented Jan 5, 2023

Stack from ghstack:

Cleaning FQN for FullyShardedDataParallel(use_orig_params=True) can cause some discrepancies with respect to the FQN compared to manually looping over named_modules() and named_parameters() together.

There is no requirement for the FQNs to be clean when using wrapper FSDP + use_orig_params=True. We can leave clean FQNs to fully_shard.

cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @desertfire

@pytorch-bot
Copy link

pytorch-bot bot commented Jan 5, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91767

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit d50af79:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: distributed (sharded) release notes category label Jan 5, 2023
@awgu awgu added release notes: distributed (fsdp) release notes category topic: improvements topic category and removed release notes: distributed (sharded) release notes category labels Jan 5, 2023
Cleaning FQN for `FullyShardedDataParallel(use_orig_params=True)` can cause some discrepancies with respect to the FQN compared to manually looping over `named_modules()` and `named_parameters()` together.

Testing in CI...

[ghstack-poisoned]
Cleaning FQN for `FullyShardedDataParallel(use_orig_params=True)` can cause some discrepancies with respect to the FQN compared to manually looping over `named_modules()` and `named_parameters()` together.

Testing in CI...

[ghstack-poisoned]
awgu pushed a commit that referenced this pull request Jan 5, 2023
ghstack-source-id: 66d46f3
Pull Request resolved: #91767
@awgu awgu changed the title [WIP] Do not clean FQNs even for use_orig_params=True [FSDP] Do not clean FQNs even for use_orig_params=True Jan 8, 2023
Cleaning FQN for `FullyShardedDataParallel(use_orig_params=True)` can cause some discrepancies with respect to the FQN compared to manually looping over `named_modules()` and `named_parameters()` together.

Testing in CI...

[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

@awgu your PR has been successfully reverted.

Cleaning FQN for `FullyShardedDataParallel(use_orig_params=True)` can cause some discrepancies with respect to the FQN compared to manually looping over `named_modules()` and `named_parameters()` together.

There is no requirement for the FQNs to be clean when using wrapper FSDP + `use_orig_params=True`. We can leave clean FQNs to `fully_shard`.

[ghstack-poisoned]
assert name.startswith(prefix)
name = name[len(prefix) :]
return torch.distributed.fsdp._common_utils.clean_tensor_name(name)
return name
Copy link
Collaborator Author

@awgu awgu Jan 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wconstab

Context: This PR changes FSDP so that when use_orig_params=True, the parameter names returned from named_parameters() include the _fsdp_wrapped_module. prefixes. Before, FSDP overrode named_parameters() to clean the prefix (see the change in fully_sharded_data_parallel.py). I found that overriding it to clean the prefix creates unwanted discrepancy with the true module structure, which can lead to some issues. We can leave having clean parameter names to the currently-developed composable APIs (e.g. fully_shard).

This change here in _dynamo/testing.py to not call clean_tensor_name() allows the unit tests to pass. I still wanted to check: does Dynamo + FSDP rely on FSDP's named_parameters() returning the cleaned names (i.e. without _fsdp_wrapped_module.)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think dynamo relies on the structure of the name either way. other than we do rely on finding the "is_fsdp_wrapped_module" boolean for enablement of fsdp handling.

question: what is the history of remove_optimized_module_prefix fn? at a glance, i have no memory of modifying this myself for fsdp but i see it had an fsdp clean param thing in it. did you put that there before and are taking it out now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed_optimized_module_prefix came from #89113

My understanding is as follows:

for name, param in model.named_parameters():
if isinstance(model, eval_frame.OptimizedModule):
name = remove_optimized_module_prefix(name)

correct_results = collect_results(eager_model, correct_outputs.logits, correct_loss, inputs_flat)
opt_results = collect_results(opt_model, opt_outputs.logits, opt_loss, inputs_flat)
self.assertTrue(same(correct_results, opt_results))

The eager_model removed the prefixes but the opt_model did not, so for parity, we needed to manually remove the prefixes for opt_model. However, with this PR, eager_model no longer removes the prefixes, so similarly, we can remove the manual removal for opt_model.

In this case, we should be safe to land this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems ok to me

Cleaning FQN for `FullyShardedDataParallel(use_orig_params=True)` can cause some discrepancies with respect to the FQN compared to manually looping over `named_modules()` and `named_parameters()` together.

There is no requirement for the FQNs to be clean when using wrapper FSDP + `use_orig_params=True`. We can leave clean FQNs to `fully_shard`.

cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @desertfire

[ghstack-poisoned]
@awgu
Copy link
Collaborator Author

awgu commented Jan 17, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@malfet
Copy link
Contributor

malfet commented Jan 17, 2023

@pytorchbot revert -m "Looks like it broke test_compatible_with_named_optimizer distribued tests, see https://hud.pytorch.org/pytorch/pytorch/commit/d6f3265e1add26abedb504910be93b393b9fb33c" -c nosignal

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

@awgu your PR has been successfully reverted.

pytorchmergebot added a commit that referenced this pull request Jan 17, 2023
)"

This reverts commit d6f3265.

Reverted #91767 on behalf of https://github.com/malfet due to Looks like it broke `test_compatible_with_named_optimizer` distribued tests, see https://hud.pytorch.org/pytorch/pytorch/commit/d6f3265e1add26abedb504910be93b393b9fb33c
@awgu
Copy link
Collaborator Author

awgu commented Jan 17, 2023

We had a land race. We will re-land this PR after a fix on the optimizer state side.

pytorchmergebot pushed a commit that referenced this pull request Jan 27, 2023
…orig_params=True`"


The last PR (#91767) had a land race relating to `_NamedOptimizer` + FSDP and got reverted. This is a re-land.

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx desertfire

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Jan 27, 2023
The last PR (#91767) had a land race relating to `_NamedOptimizer` + FSDP and got reverted. This is a re-land.

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx desertfire

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Jan 28, 2023
…orig_params=True`"


The last PR (#91767) had a land race relating to `_NamedOptimizer` + FSDP and got reverted. This is a re-land.

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx desertfire

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Jan 28, 2023
The last PR (#91767) had a land race relating to `_NamedOptimizer` + FSDP and got reverted. This is a re-land.

cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx desertfire

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Jan 30, 2023
The last PR (#91767) had a land race relating to `_NamedOptimizer` + FSDP and got reverted. This is a re-land.

Pull Request resolved: #92662
Approved by: https://github.com/rohan-varma
@facebook-github-bot facebook-github-bot deleted the gh/awgu/288/head branch June 8, 2023 15:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants