Skip to content

Conversation

@mrshenli
Copy link
Contributor

@mrshenli mrshenli commented Jan 17, 2023

Stack from ghstack (oldest at bottom):

Differential Revision: D42554973

@pytorch-bot
Copy link

pytorch-bot bot commented Jan 17, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/92334

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 Failures

As of commit 4303e30:

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following jobs failed but were present on the merge base 013afc5:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: distributed (c10d) release notes category label Jan 17, 2023
mrshenli added a commit that referenced this pull request Jan 17, 2023
ghstack-source-id: d2fa72c
Pull Request resolved: #92334
@mrshenli mrshenli added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 17, 2023
@mrshenli
Copy link
Contributor Author

@mrshenli has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@zhaojuanmao zhaojuanmao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering whether we can switch to use treemap at some time of point to make the output data type support more general? it is the same for FSDP as well

@awgu
Copy link
Collaborator

awgu commented Jan 18, 2023

Wondering whether we can switch to use treemap at some time of point to make the output data type support more general? it is the same for FSDP as well

tree_map() does not support dataclass and PackedSequence (which FSDP's _apply_to_tensor() does). If we get tree_map() to cover a superset, then we should be able to migrate.

mrshenli added a commit that referenced this pull request Jan 18, 2023
ghstack-source-id: 09f9a65
Pull Request resolved: #92334
@mrshenli
Copy link
Contributor Author

@mrshenli has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@mrshenli
Copy link
Contributor Author

Test failure is irrelevant:

======================================================================
ERROR [4.113s]: test_compatible_with_named_optimizer (__main__.TestFSDPOptimState)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 539, in wrapper
    self._join_processes(fn)
  File "/opt/conda/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 765, in _join_processes
    self._check_return_codes(elapsed_time)
  File "/opt/conda/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 810, in _check_return_codes
    raise RuntimeError(error)
RuntimeError: Process 0 exited with error code 10 and exception:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 663, in run_test
    getattr(self, test_name)()
  File "/opt/conda/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 541, in wrapper
    fn()
  File "/opt/conda/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 172, in wrapper
    return func(*args, **kwargs)
  File "/var/lib/jenkins/workspace/test/distributed/fsdp/test_fsdp_optim_state.py", line 1479, in test_compatible_with_named_optimizer
    state_dicts.append(FSDP._optim_state_dict(model, optim))
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1734, in _optim_state_dict
    return FullyShardedDataParallel.full_optim_state_dict(
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1291, in full_optim_state_dict
    optim_state_dict=optim.state_dict(),
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/optim/named_optimizer.py", line 145, in state_dict
    return self._post_state_dict({"state": ret_state, "param_groups": ret_groups})
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/optim/named_optimizer.py", line 276, in _post_state_dict
    FSDP._optim_state_dict_post_hook(self.module, self._optimizer, state_dict)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1694, in _optim_state_dict_post_hook
    return FullyShardedDataParallel._optim_state_dict_impl(
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1164, in _optim_state_dict_impl
    return _optim_state_dict(
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/_optim_utils.py", line 1393, in _optim_state_dict
    fsdp_osd["param_groups"] = _unflatten_param_groups(
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/_optim_utils.py", line 1240, in _unflatten_param_groups
    param_group_params = [
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/_optim_utils.py", line 1241, in <listcomp>
    param_key_to_param[flat_param_key]
KeyError: '_fsdp_wrapped_module.net1.0.bias'

@mrshenli
Copy link
Contributor Author

@pytorchbot merge -f "test failures are irrelevant"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@facebook-github-bot facebook-github-bot deleted the gh/mrshenli/361/head branch June 8, 2023 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants