Allow DDP to handle custom dataclass forward outputs #92334

mrshenli · 2023-01-17T20:55:07Z

Stack from ghstack (oldest at bottom):

-> Allow DDP to handle custom dataclass forward outputs #92334

Differential Revision: D42554973

[ghstack-poisoned]

pytorch-bot · 2023-01-17T20:55:09Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/92334

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 Failures

As of commit 4303e30:

NEW FAILURES - The following jobs have failed:

linux-focal-py3.7-clang7-asan / test (default, 2, 5, linux.2xlarge)

BROKEN TRUNK - The following jobs failed but were present on the merge base 013afc5:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: d2fa72c Pull Request resolved: #92334

mrshenli · 2023-01-17T20:57:14Z

@mrshenli has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

zhaojuanmao

Wondering whether we can switch to use treemap at some time of point to make the output data type support more general? it is the same for FSDP as well

torch/nn/parallel/distributed.py

awgu · 2023-01-18T00:19:57Z

Wondering whether we can switch to use treemap at some time of point to make the output data type support more general? it is the same for FSDP as well

tree_map() does not support dataclass and PackedSequence (which FSDP's _apply_to_tensor() does). If we get tree_map() to cover a superset, then we should be able to migrate.

Differential Revision: [D42554973](https://our.internmc.facebook.com/intern/diff/D42554973) [ghstack-poisoned]

ghstack-source-id: 09f9a65 Pull Request resolved: #92334

mrshenli · 2023-01-18T02:53:29Z

@mrshenli has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

mrshenli · 2023-01-18T04:46:36Z

Test failure is irrelevant:

======================================================================
ERROR [4.113s]: test_compatible_with_named_optimizer (__main__.TestFSDPOptimState)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 539, in wrapper
    self._join_processes(fn)
  File "/opt/conda/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 765, in _join_processes
    self._check_return_codes(elapsed_time)
  File "/opt/conda/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 810, in _check_return_codes
    raise RuntimeError(error)
RuntimeError: Process 0 exited with error code 10 and exception:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 663, in run_test
    getattr(self, test_name)()
  File "/opt/conda/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 541, in wrapper
    fn()
  File "/opt/conda/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 172, in wrapper
    return func(*args, **kwargs)
  File "/var/lib/jenkins/workspace/test/distributed/fsdp/test_fsdp_optim_state.py", line 1479, in test_compatible_with_named_optimizer
    state_dicts.append(FSDP._optim_state_dict(model, optim))
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1734, in _optim_state_dict
    return FullyShardedDataParallel.full_optim_state_dict(
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1291, in full_optim_state_dict
    optim_state_dict=optim.state_dict(),
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/optim/named_optimizer.py", line 145, in state_dict
    return self._post_state_dict({"state": ret_state, "param_groups": ret_groups})
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/optim/named_optimizer.py", line 276, in _post_state_dict
    FSDP._optim_state_dict_post_hook(self.module, self._optimizer, state_dict)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1694, in _optim_state_dict_post_hook
    return FullyShardedDataParallel._optim_state_dict_impl(
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1164, in _optim_state_dict_impl
    return _optim_state_dict(
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/_optim_utils.py", line 1393, in _optim_state_dict
    fsdp_osd["param_groups"] = _unflatten_param_groups(
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/_optim_utils.py", line 1240, in _unflatten_param_groups
    param_group_params = [
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/_optim_utils.py", line 1241, in <listcomp>
    param_key_to_param[flat_param_key]
KeyError: '_fsdp_wrapped_module.net1.0.bias'

mrshenli · 2023-01-18T14:49:47Z

@pytorchbot merge -f "test failures are irrelevant"

pytorchmergebot · 2023-01-18T14:51:34Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Allow DDP to handle custom dataclass forward outputs

e62d4c0

[ghstack-poisoned]

mrshenli requested review from H-Huang, awgu, kwen2501, rohan-varma, wanchaol and zhaojuanmao as code owners January 17, 2023 20:55

pytorch-bot bot added the release notes: distributed (c10d) release notes category label Jan 17, 2023

mrshenli added a commit that referenced this pull request Jan 17, 2023

Allow DDP to handle custom dataclass forward outputs

90b97ca

ghstack-source-id: d2fa72c Pull Request resolved: #92334

mrshenli added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 17, 2023

zhaojuanmao approved these changes Jan 17, 2023

View reviewed changes

Skylion007 reviewed Jan 17, 2023

View reviewed changes

torch/nn/parallel/distributed.py Outdated Show resolved Hide resolved

Update on "Allow DDP to handle custom dataclass forward outputs"

4303e30

Differential Revision: [D42554973](https://our.internmc.facebook.com/intern/diff/D42554973) [ghstack-poisoned]

mrshenli added a commit that referenced this pull request Jan 18, 2023

Allow DDP to handle custom dataclass forward outputs

0af9f64

ghstack-source-id: 09f9a65 Pull Request resolved: #92334

pytorchmergebot added the Merged label Jan 18, 2023

pytorchmergebot closed this in 0035340 Jan 18, 2023

facebook-github-bot deleted the gh/mrshenli/361/head branch June 8, 2023 18:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow DDP to handle custom dataclass forward outputs #92334

Allow DDP to handle custom dataclass forward outputs #92334

Uh oh!

mrshenli commented Jan 17, 2023 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jan 17, 2023 •

edited

Loading

Uh oh!

mrshenli commented Jan 17, 2023

Uh oh!

zhaojuanmao left a comment

Uh oh!

Uh oh!

awgu commented Jan 18, 2023

Uh oh!

mrshenli commented Jan 18, 2023

Uh oh!

mrshenli commented Jan 18, 2023

Uh oh!

mrshenli commented Jan 18, 2023

Uh oh!

pytorchmergebot commented Jan 18, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Allow DDP to handle custom dataclass forward outputs #92334

Allow DDP to handle custom dataclass forward outputs #92334

Uh oh!

Conversation

mrshenli commented Jan 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jan 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/92334

❌ 3 Failures

Uh oh!

mrshenli commented Jan 17, 2023

Uh oh!

zhaojuanmao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

awgu commented Jan 18, 2023

Uh oh!

mrshenli commented Jan 18, 2023

Uh oh!

mrshenli commented Jan 18, 2023

Uh oh!

mrshenli commented Jan 18, 2023

Uh oh!

pytorchmergebot commented Jan 18, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

mrshenli commented Jan 17, 2023 •

edited

Loading

pytorch-bot bot commented Jan 17, 2023 •

edited

Loading