Hybrid Sharded Data Parallel #89915

rohan-varma · 2022-11-30T15:50:34Z

Stack from ghstack (oldest at bottom):

-> Hybrid Sharded Data Parallel #89915

Adds 2 new hybrid sharding strategy to FSDP:

HYBRID_SHARD: applies zero-3 style sharding within a node, and data parallel across
HYBRID_SHARD_ZERO2: applies zero-2 style sharding within a node, and data parallel across

These are useful for medium sized models and aim to decrease communication volume, tests and benchmarks will be run to understand which workloads are optimal under which sharding strategy.

Hybrid sharding in general works by sharding the model using a process group within a single node, and creating intra-node process groups for replication / data parallelism. The user either needs to pass in a tuple of these process groups, or None, and we generate the process groups appropriately.

** Acknowledgements **

@awgu 's excellent prototype: awgu@5ad3a16
@liangluofb For ideation, feedback, and initial implementation and experimentation

[ghstack-poisoned]

pytorch-bot · 2022-11-30T15:50:37Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/89915

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 Failures

As of commit 4e9638a:

The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 93f5a35 Pull Request resolved: #89915

[ghstack-poisoned]

Adds 2 new hybrid sharding strategy to FSDP: 1. HYBRID_SHARD: applies zero-3 style sharding within a node, and data parallel across 2. HYBRID_SHARD_ZERO2: applies zero-2 style sharding within a node, and data parallel across These are useful for medium sized models and aim to decrease communication volume, tests and benchmarks will be run to understand which workloads are optimal under which sharding strategy. Hybrid sharding in general works by sharding the model using a process group within a single node, and creating intra-node process groups for replication / data parallelism. The user either needs to pass in a tuple of these process groups, or None, and we generate the process groups appropriately. ** Acknowledgements ** - awgu 's excellent prototype: awgu@5ad3a16 - liangluofb For ideation, feedback, and initial implementation and experimentation [ghstack-poisoned]

ghstack-source-id: aee66d4 Pull Request resolved: #89915

Adds 2 new hybrid sharding strategy to FSDP: 1. HYBRID_SHARD: applies zero-3 style sharding within a node, and data parallel across 2. HYBRID_SHARD_ZERO2: applies zero-2 style sharding within a node, and data parallel across These are useful for medium sized models and aim to decrease communication volume, tests and benchmarks will be run to understand which workloads are optimal under which sharding strategy. Hybrid sharding in general works by sharding the model using a process group within a single node, and creating intra-node process groups for replication / data parallelism. The user either needs to pass in a tuple of these process groups, or None, and we generate the process groups appropriately. ** Acknowledgements ** - awgu 's excellent prototype: awgu@5ad3a16 - liangluofb For ideation, feedback, and initial implementation and experimentation [ghstack-poisoned]

ghstack-source-id: 9d6258e Pull Request resolved: #89915

awgu

I made an initial pass and left a lot of nitpicks. I will read the test code in a follow-up pass, possibly after you respond to some of the comments.

torch/distributed/fsdp/api.py

torch/distributed/fsdp/_init_utils.py

torch/distributed/fsdp/_runtime_utils.py

awgu · 2022-12-02T23:58:08Z

torch/distributed/fsdp/fully_sharded_data_parallel.py

                # FSDP module directly
                submodule._fsdp_use_orig_params = use_orig_params

+        # Initializes self.process_group, along with rank and world size. This will


nit: Personally, I do not like explaining what a function/method call does inline like this since this creates redundancy, which can go stale if only one place is updated. The developer should read the docstring for _init_process_group_state.

[Easy] I recommend changing before landing

I mostly want to emphasize the part a couple lines later that mentions this is done before auto wrapping, and the logic for why, which I think is vaulable.

Sounds good.

awgu · 2022-12-04T00:37:52Z

Just as a heads up, it looks like test failures are real:

======================================================================
ERROR [0.002s]: test_wrap_wrap_method_WrapMethod_WRAP_API (__main__.TestAutoWrap)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 247, in instantiated_test
    test(self, **param_kwargs)
  File "/var/lib/jenkins/workspace/test/distributed/fsdp/test_wrap.py", line 319, in test_wrap
    layer = wrap(nn.Linear(5, 5))
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 295, in wrap
    return _wrap(
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 313, in _wrap
    return wrapper_cls(module, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 376, in __init__
    _init_process_group_state(self, process_group, sharding_strategy, auto_wrap_policy)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py", line 110, in _init_process_group_state
    raise ValueError(
ValueError: process_group should be None or dist.ProcessGroup, but got <class 'torch.testing._internal.common_fsdp.DummyProcessGroup'>

I am not sure if we can make DummyProcessGroup inherit from dist.ProcessGroup as a fix.

torch/distributed/fsdp/fully_sharded_data_parallel.py

rohan-varma · 2022-12-07T20:03:22Z

torch/distributed/fsdp/api.py

+        nodes. This results in reduced communication volume as expensive all-gathers and
+        reduce-scatters are only done within a node, which can be more performant for medium
+        -sized models.
+    - ``_HYBRID_SHARD_ZERO2``: Apply ``SHARD_GRAD_OP`` within a node, and replicate parameters across


I guess we should omit this from docstring for now

Yes, I think that would be good to be safe.

awgu

I have a final pass of nits, and we can land after that.

test/distributed/fsdp/test_fsdp_hybrid_shard.py

awgu · 2022-12-07T20:26:10Z

torch/distributed/fsdp/_init_utils.py

+            )
+        else:
+            state = _init_process_group_state_for_hybrid_shard(state, process_group)
+            assert state.process_group is not None, "Expected to populate state.process_group for hybrid shard"


I think the asserts should be at the end of _init_process_group_state_for_hybrid_shard then, representing a post-condition. This is just my personal design preference and probably does not matter in this case. However, in general, you would have to enforce the post-condition upon each call to _init_process_group_state_for_hybrid_shard.

(This is also how I approach post-conditions and invariants in general -- they should be coupled to the method/function itself, not their usages.)

torch/distributed/fsdp/_init_utils.py

torch/distributed/fsdp/_runtime_utils.py

torch/testing/_internal/common_fsdp.py

Adds 2 new hybrid sharding strategy to FSDP: 1. HYBRID_SHARD: applies zero-3 style sharding within a node, and data parallel across 2. HYBRID_SHARD_ZERO2: applies zero-2 style sharding within a node, and data parallel across These are useful for medium sized models and aim to decrease communication volume, tests and benchmarks will be run to understand which workloads are optimal under which sharding strategy. Hybrid sharding in general works by sharding the model using a process group within a single node, and creating intra-node process groups for replication / data parallelism. The user either needs to pass in a tuple of these process groups, or None, and we generate the process groups appropriately. ** Acknowledgements ** - awgu 's excellent prototype: awgu@5ad3a16 - liangluofb For ideation, feedback, and initial implementation and experimentation [ghstack-poisoned]

awgu

Looks great to me! Awesome work, and thanks for fixing all of the nits. I am very excited to see the experiment results and downstream impact!

awgu · 2022-12-07T21:58:00Z

test/distributed/fsdp/test_fsdp_hybrid_shard.py

+# Owner(s): ["oncall: distributed"]
+
+import contextlib
+import functools


Suggested change

import functools

Adds 2 new hybrid sharding strategy to FSDP: 1. HYBRID_SHARD: applies zero-3 style sharding within a node, and data parallel across 2. HYBRID_SHARD_ZERO2: applies zero-2 style sharding within a node, and data parallel across These are useful for medium sized models and aim to decrease communication volume, tests and benchmarks will be run to understand which workloads are optimal under which sharding strategy. Hybrid sharding in general works by sharding the model using a process group within a single node, and creating intra-node process groups for replication / data parallelism. The user either needs to pass in a tuple of these process groups, or None, and we generate the process groups appropriately. ** Acknowledgements ** - awgu 's excellent prototype: awgu@5ad3a16 - liangluofb For ideation, feedback, and initial implementation and experimentation [ghstack-poisoned]

ghstack-source-id: 29e4096 Pull Request resolved: #89915

Adds 2 new hybrid sharding strategy to FSDP: 1. HYBRID_SHARD: applies zero-3 style sharding within a node, and data parallel across 2. HYBRID_SHARD_ZERO2: applies zero-2 style sharding within a node, and data parallel across These are useful for medium sized models and aim to decrease communication volume, tests and benchmarks will be run to understand which workloads are optimal under which sharding strategy. Hybrid sharding in general works by sharding the model using a process group within a single node, and creating intra-node process groups for replication / data parallelism. The user either needs to pass in a tuple of these process groups, or None, and we generate the process groups appropriately. ** Acknowledgements ** - awgu 's excellent prototype: awgu@5ad3a16 - liangluofb For ideation, feedback, and initial implementation and experimentation [ghstack-poisoned]

ghstack-source-id: c063e71 Pull Request resolved: #89915

rohan-varma · 2022-12-08T16:10:34Z

CI failures are related to autocast and are unrelated to this PR.

rohan-varma · 2022-12-08T16:10:46Z

@pytorchbot merge -f "CI failures unrelated"

pytorchmergebot · 2022-12-08T16:17:58Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

jeffdaily · 2022-12-09T19:43:18Z

This PR broke ROCm CI periodic jobs, where the distributed tests get run.

======================================================================
ERROR [4.530s]: test_fsdp_hybrid_shard_basic_setup (__main__.TestFSDPHybridShard)
Tests basic functionality of HYBRID_SHARD and _HYBRID_SHARD_ZERO2:
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 533, in wrapper
    self._join_processes(fn)
  File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 759, in _join_processes
    self._check_return_codes(elapsed_time)
  File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 804, in _check_return_codes
    raise RuntimeError(error)
RuntimeError: Process 0 exited with error code 10 and exception:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 657, in run_test
    getattr(self, test_name)()
  File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 535, in wrapper
    fn()
  File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 166, in wrapper
    return func(*args, **kwargs)
  File "/var/lib/jenkins/pytorch/test/distributed/fsdp/test_fsdp_hybrid_shard.py", line 226, in test_fsdp_hybrid_shard_basic_setup
    with (
AttributeError: __enter__

jeffdaily · 2022-12-09T19:45:11Z

Not sure how this PR only broke ROCm CI, unless this test is getting skipped on other platforms?

The test is using the context managers from earlier in the file, so not sure why __enter__ isn't an attribute?

https://github.com/pytorch/pytorch/pull/89915/files#diff-19a0c73c8366fbd896a68ad90a2f4b3515f0fa6dc0a61528a2780e418b3f5e92R44

awgu · 2022-12-09T23:08:28Z

Not sure how this PR only broke ROCm CI, unless this test is getting skipped on other platforms?

The test is using the context managers from earlier in the file, so not sure why __enter__ isn't an attribute?

https://github.com/pytorch/pytorch/pull/89915/files#diff-19a0c73c8366fbd896a68ad90a2f4b3515f0fa6dc0a61528a2780e418b3f5e92R44

This could be because of a Python versioning issue. Parenthesized context managers may have only been added in Python 3.10:
https://docs.python.org/3.10/whatsnew/3.10.html#parenthesized-context-managers
https://www.blog.pythonlibrary.org/2021/09/08/python-3-10-parenthesized-context-managers/

In other words, the following syntax was not permitted until 3.10:

with (
    patch_allreduce(patched_allreduce),
    patch_reduce_scatter(patched_reduce_scatter),
):

This could be why the __enter__ is not being detected properly.

jeffdaily · 2022-12-09T23:25:25Z

@awgu I just figured out the same. PR to fix. #90580

@awgu

Adds 2 new hybrid sharding strategy to FSDP: 1. HYBRID_SHARD: applies zero-3 style sharding within a node, and data parallel across 2. HYBRID_SHARD_ZERO2: applies zero-2 style sharding within a node, and data parallel across These are useful for medium sized models and aim to decrease communication volume, tests and benchmarks will be run to understand which workloads are optimal under which sharding strategy. Hybrid sharding in general works by sharding the model using a process group within a single node, and creating intra-node process groups for replication / data parallelism. The user either needs to pass in a tuple of these process groups, or None, and we generate the process groups appropriately. ** Acknowledgements ** - @awgu 's excellent prototype: awgu@5ad3a16 - @liangluofb For ideation, feedback, and initial implementation and experimentation Pull Request resolved: pytorch#89915 Approved by: https://github.com/awgu

Fixes PR #89915. The following syntax was not permitted until 3.10: ``` with ( patch_allreduce(patched_allreduce), patch_reduce_scatter(patched_reduce_scatter), ): ``` Pull Request resolved: #90580 Approved by: https://github.com/awgu

HSDP

6385df9

[ghstack-poisoned]

rohan-varma requested review from H-Huang, awgu, kwen2501, mrshenli, pritamdamania87, yhcharles and zhaojuanmao as code owners November 30, 2022 15:50

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Nov 30, 2022

rohan-varma added a commit that referenced this pull request Nov 30, 2022

HSDP

f95105c

ghstack-source-id: 93f5a35 Pull Request resolved: #89915

Update on "HSDP"

763600e

[ghstack-poisoned]

rohan-varma changed the title ~~HSDP~~ Hybrid Sharded Data Parallel Dec 1, 2022

rohan-varma added a commit that referenced this pull request Dec 1, 2022

HSDP

e132859

ghstack-source-id: aee66d4 Pull Request resolved: #89915

rohan-varma added a commit that referenced this pull request Dec 2, 2022

HSDP

2d64bdc

ghstack-source-id: 9d6258e Pull Request resolved: #89915

awgu reviewed Dec 2, 2022

View reviewed changes

awgu self-requested a review December 4, 2022 00:32

awgu reviewed Dec 7, 2022

View reviewed changes

torch/distributed/fsdp/fully_sharded_data_parallel.py Outdated Show resolved Hide resolved

rohan-varma requested a review from awgu December 7, 2022 20:00

rohan-varma commented Dec 7, 2022

View reviewed changes

awgu reviewed Dec 7, 2022

View reviewed changes

torch/testing/_internal/common_fsdp.py Outdated Show resolved Hide resolved

rohan-varma requested a review from awgu December 7, 2022 21:48

awgu approved these changes Dec 7, 2022

View reviewed changes

awgu reviewed Dec 7, 2022

View reviewed changes

rohan-varma added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 7, 2022

rohan-varma added a commit that referenced this pull request Dec 8, 2022

HSDP

4440102

ghstack-source-id: 29e4096 Pull Request resolved: #89915

rohan-varma added a commit that referenced this pull request Dec 8, 2022

HSDP

58ecd0d

ghstack-source-id: c063e71 Pull Request resolved: #89915

pytorchmergebot added the Merged label Dec 8, 2022

pytorchmergebot closed this in 793a999 Dec 8, 2022

jeffdaily mentioned this pull request Dec 9, 2022

fix with statement in test_fsdp_hybrid_shard.py #90580

Closed

facebook-github-bot deleted the gh/rohan-varma/617/head branch June 8, 2023 18:36

Hybrid Sharded Data Parallel #89915

Hybrid Sharded Data Parallel #89915

Uh oh!

Conversation

rohan-varma commented Nov 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/89915

❌ 3 Failures

Uh oh!

awgu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

awgu commented Dec 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

awgu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

awgu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rohan-varma commented Dec 8, 2022

Uh oh!

rohan-varma commented Dec 8, 2022

Uh oh!

pytorchmergebot commented Dec 8, 2022

Merge started

Uh oh!

jeffdaily commented Dec 9, 2022

Uh oh!

jeffdaily commented Dec 9, 2022

Uh oh!

awgu commented Dec 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffdaily commented Dec 9, 2022

Uh oh!

Reviewers

Assignees

Labels

rohan-varma commented Nov 30, 2022 •

edited

Loading

pytorch-bot bot commented Nov 30, 2022 •

edited

Loading

awgu commented Dec 4, 2022 •

edited

Loading

awgu commented Dec 9, 2022 •

edited

Loading