Add ReplicatedParameter and ReplicatedTensor to DDP. #74787

pritamdamania87 · 2022-03-26T00:57:27Z

Stack from ghstack (oldest at bottom):

-> Add ReplicatedParameter and ReplicatedTensor to DDP. #74787

As per the design in #72138,
introduce ReplicatedParameter and replace DDP parameters with
ReplicatedParameters and DDP buffers as ReplicatedTensors.

Differential Revision: D35162813

As per the design in #72138, introduce ReplicatedParameter and replace DDP parameters with ReplicatedParameters and DDP buffers as ReplicatedTensors. Differential Revision: [D35162813](https://our.internmc.facebook.com/intern/diff/D35162813/) [ghstack-poisoned]

facebook-github-bot · 2022-03-26T00:57:29Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/74787
↩️ [fb-only] Re-run with SSH instructions
Need help or want to give feedback on the CI? Visit our office hours

💊 CI failures summary and remediations

As of commit b248f29 (more details on the Dr. CI page):

10/10 failures introduced in this PR

🕵️ 8 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pull / win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge) (1/8)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-03-30T23:07:14.8839229Z AssertionError: 11 unit test(s) failed:

2022-03-30T23:07:14.7083964Z 
2022-03-30T23:07:14.7084246Z OK (skipped=1)
2022-03-30T23:07:14.7084452Z 
2022-03-30T23:07:14.7084783Z Generating XML reports...
2022-03-30T23:07:14.7085600Z Generated XML report: test-reports\python-unittest\distributed.test_c10d_gloo\TEST-TimeoutTest-20220330230714.xml
2022-03-30T23:07:14.8836062Z Traceback (most recent call last):
2022-03-30T23:07:14.8836778Z   File "distributed/test_c10d_gloo.py", line 2336, in <module>
2022-03-30T23:07:14.8837225Z     run_tests()
2022-03-30T23:07:14.8837894Z   File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\testing\_internal\common_utils.py", line 633, in run_tests
2022-03-30T23:07:14.8838719Z     assert len(failed_tests) == 0, "{} unit test(s) failed:\n\t{}".format(
2022-03-30T23:07:14.8839229Z AssertionError: 11 unit test(s) failed:
2022-03-30T23:07:14.8839929Z 	DistributedDataParallelTest.test_ddp_comm_hook_future_passing_cpu
2022-03-30T23:07:14.8840812Z 	DistributedDataParallelTest.test_ddp_comm_hook_register_just_once
2022-03-30T23:07:14.8841705Z 	DistributedDataParallelTest.test_ddp_comm_hook_sparse_gradients
2022-03-30T23:07:14.8842580Z 	DistributedDataParallelTest.test_ddp_invalid_comm_hook_init
2022-03-30T23:07:14.8843438Z 	DistributedDataParallelTest.test_ddp_invalid_comm_hook_return_type
2022-03-30T23:07:14.8844321Z 	DistributedDataParallelTest.test_gloo_backend_cpu_module
2022-03-30T23:07:14.8845212Z 	DistributedDataParallelTest.test_gloo_backend_cpu_module_grad_is_view
2022-03-30T23:07:14.8846076Z 	DistributedDataParallelTest.test_ignored_output
2022-03-30T23:07:14.8846947Z 	DistributedDataParallelTest.test_ignored_output_with_unused_parameters
2022-03-30T23:07:14.8847837Z 	DistributedDataParallelTest.test_sparse_gradients

pull / win-vs2019-cuda11.3-py3 / test (default, 2, 2, windows.8xlarge.nvidia.gpu) (2/8)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-03-31T01:35:42.3899479Z AssertionError: 3 unit test(s) failed:

2022-03-31T01:35:42.0146843Z 
2022-03-31T01:35:42.0147449Z OK (skipped=1)
2022-03-31T01:35:42.0147902Z 
2022-03-31T01:35:42.0148601Z Generating XML reports...
2022-03-31T01:35:42.0151300Z Generated XML report: test-reports\dist-gloo\distributed.test_c10d_spawn_gloo\TEST-TestDistributedNNFunctionsGloo-20220331013539.xml
2022-03-31T01:35:42.3895098Z Traceback (most recent call last):
2022-03-31T01:35:42.3896052Z   File "distributed/test_c10d_spawn_gloo.py", line 278, in <module>
2022-03-31T01:35:42.3896725Z     run_tests()
2022-03-31T01:35:42.3897632Z   File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\testing\_internal\common_utils.py", line 633, in run_tests
2022-03-31T01:35:42.3898820Z     assert len(failed_tests) == 0, "{} unit test(s) failed:\n\t{}".format(
2022-03-31T01:35:42.3899479Z AssertionError: 3 unit test(s) failed:
2022-03-31T01:35:42.3900538Z 	DistributedDataParallelSingleProcessTest.test_cpu
2022-03-31T01:35:42.3902021Z 	DistributedDataParallelSingleProcessTest.test_cuda
2022-03-31T01:35:42.3903550Z 	DistributedDataParallelSingleProcessTest.test_rnn
2022-03-31T01:35:42.7959457Z Traceback (most recent call last):
2022-03-31T01:35:42.7960380Z   File "run_test.py", line 1054, in <module>
2022-03-31T01:35:42.7960859Z     main()
2022-03-31T01:35:42.7961349Z   File "run_test.py", line 1032, in main
2022-03-31T01:35:42.7961986Z     raise RuntimeError(err_message)
2022-03-31T01:35:42.7962829Z RuntimeError: distributed/test_c10d_spawn_gloo failed!
2022-03-31T01:35:43.2335205Z

pull / win-vs2019-cuda11.3-py3 / test (force_on_cpu, 1, 1, windows.4xlarge) (3/8)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-03-30T23:09:47.5664762Z AssertionError: 11 unit test(s) failed:

2022-03-30T23:09:47.3928152Z 
2022-03-30T23:09:47.3928434Z OK (skipped=1)
2022-03-30T23:09:47.3928642Z 
2022-03-30T23:09:47.3928973Z Generating XML reports...
2022-03-30T23:09:47.3929798Z Generated XML report: test-reports\python-unittest\distributed.test_c10d_gloo\TEST-TimeoutTest-20220330230946.xml
2022-03-30T23:09:47.5661677Z Traceback (most recent call last):
2022-03-30T23:09:47.5662357Z   File "distributed/test_c10d_gloo.py", line 2336, in <module>
2022-03-30T23:09:47.5662799Z     run_tests()
2022-03-30T23:09:47.5663498Z   File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\testing\_internal\common_utils.py", line 633, in run_tests
2022-03-30T23:09:47.5664268Z     assert len(failed_tests) == 0, "{} unit test(s) failed:\n\t{}".format(
2022-03-30T23:09:47.5664762Z AssertionError: 11 unit test(s) failed:
2022-03-30T23:09:47.5665445Z 	DistributedDataParallelTest.test_ddp_comm_hook_future_passing_cpu
2022-03-30T23:09:47.5666341Z 	DistributedDataParallelTest.test_ddp_comm_hook_register_just_once
2022-03-30T23:09:47.5667226Z 	DistributedDataParallelTest.test_ddp_comm_hook_sparse_gradients
2022-03-30T23:09:47.5668100Z 	DistributedDataParallelTest.test_ddp_invalid_comm_hook_init
2022-03-30T23:09:47.5668969Z 	DistributedDataParallelTest.test_ddp_invalid_comm_hook_return_type
2022-03-30T23:09:47.5669840Z 	DistributedDataParallelTest.test_gloo_backend_cpu_module
2022-03-30T23:09:47.5670718Z 	DistributedDataParallelTest.test_gloo_backend_cpu_module_grad_is_view
2022-03-30T23:09:47.5671742Z 	DistributedDataParallelTest.test_ignored_output
2022-03-30T23:09:47.5672618Z 	DistributedDataParallelTest.test_ignored_output_with_unused_parameters
2022-03-30T23:09:47.5673505Z 	DistributedDataParallelTest.test_sparse_gradients

pull / linux-xenial-py3.7-gcc5.4 / test (distributed, 1, 1, linux.2xlarge) (4/8)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-03-30T22:50:36.3781424Z AssertionError: 4 unit test(s) failed:

2022-03-30T22:50:36.2070504Z 
2022-03-30T22:50:36.2070565Z OK
2022-03-30T22:50:36.2070656Z 
2022-03-30T22:50:36.2070750Z Generating XML reports...
2022-03-30T22:50:36.2115906Z Generated XML report: test-reports/python-unittest/distributed.rpc.test_tensorpipe_agent/TEST-TensorPipeThreeWorkersRemoteModuleTest-20220330225034.xml
2022-03-30T22:50:36.3771317Z Traceback (most recent call last):
2022-03-30T22:50:36.3779611Z   File "distributed/rpc/test_tensorpipe_agent.py", line 38, in <module>
2022-03-30T22:50:36.3780367Z     run_tests()
2022-03-30T22:50:36.3780824Z   File "/opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 634, in run_tests
2022-03-30T22:50:36.3781189Z     len(failed_tests), '\n\t'.join(failed_tests))
2022-03-30T22:50:36.3781424Z AssertionError: 4 unit test(s) failed:
2022-03-30T22:50:36.3781695Z 	TensorPipeDdpComparisonTest.test_ddp_dist_autograd_sparse_grads
2022-03-30T22:50:36.3782038Z 	TensorPipeDdpUnderDistAutogradTest.test_backward_ddp_inside
2022-03-30T22:50:36.3782379Z 	TensorPipeDdpUnderDistAutogradTest.test_backward_ddp_outside
2022-03-30T22:50:36.3782725Z 	TensorPipeDdpUnderDistAutogradTest.test_backward_ddp_outside_uneven_inputs
2022-03-30T22:50:36.4916884Z Traceback (most recent call last):
2022-03-30T22:50:36.4917158Z   File "test/run_test.py", line 1054, in <module>
2022-03-30T22:50:36.4919551Z     main()
2022-03-30T22:50:36.4919808Z   File "test/run_test.py", line 1032, in main
2022-03-30T22:50:36.4921674Z     raise RuntimeError(err_message)
2022-03-30T22:50:36.4921953Z RuntimeError: distributed/rpc/test_tensorpipe_agent failed!

pull / linux-xenial-cuda11.3-py3.7-gcc7 / build (5/8)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

2022-03-30T22:07:33.0684272Z �[36;1m echo "ERR...t available for the merge-base of your branch"�[0m

2022-03-30T22:07:33.0681347Z �[36;1mfi�[0m
2022-03-30T22:07:33.0681574Z �[36;1m# Covers the case where a previous tag doesn't exist for the tree�[0m
2022-03-30T22:07:33.0681903Z �[36;1m# this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly�[0m
2022-03-30T22:07:33.0682225Z �[36;1mif ! git rev-parse "$MERGE_BASE:.circleci/docker"; then�[0m
2022-03-30T22:07:33.0682567Z �[36;1m  echo "Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit"�[0m
2022-03-30T22:07:33.0682927Z �[36;1m  exit 1�[0m
2022-03-30T22:07:33.0683081Z �[36;1mfi�[0m
2022-03-30T22:07:33.0683320Z �[36;1mPREVIOUS_DOCKER_TAG=$(git rev-parse "$MERGE_BASE:.circleci/docker")�[0m
2022-03-30T22:07:33.0683653Z �[36;1m# If no image exists but the hash is the same as the previous hash then we should error out here�[0m
2022-03-30T22:07:33.0683945Z �[36;1mif [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then�[0m
2022-03-30T22:07:33.0684272Z �[36;1m  echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"�[0m
2022-03-30T22:07:33.0684617Z �[36;1m  echo "       contact the PyTorch team to restore the original images"�[0m
2022-03-30T22:07:33.0684909Z �[36;1m  exit 1�[0m
2022-03-30T22:07:33.0685062Z �[36;1mfi�[0m
2022-03-30T22:07:33.0685256Z �[36;1mecho ::set-output name=rebuild::yes�[0m
2022-03-30T22:07:33.0695558Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2022-03-30T22:07:33.0695765Z env:
2022-03-30T22:07:33.0695926Z   IN_CI: 1
2022-03-30T22:07:33.0696086Z   IS_GHA: 1
2022-03-30T22:07:33.0696288Z   BASE_REVISION: 7eade146193d4a87958763fbc83a2d40c07faf8e
2022-03-30T22:07:33.0696704Z   DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7:a2c09c6009bb8a10cbb45a8c5f7c573593289be0

pull / linux-xenial-cuda11.3-py3.7-gcc7-bazel-test / build-and-test (6/8)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

2022-03-30T22:07:29.9628930Z �[36;1m echo "ERR...t available for the merge-base of your branch"�[0m

2022-03-30T22:07:29.9625934Z �[36;1mfi�[0m
2022-03-30T22:07:29.9626160Z �[36;1m# Covers the case where a previous tag doesn't exist for the tree�[0m
2022-03-30T22:07:29.9626498Z �[36;1m# this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly�[0m
2022-03-30T22:07:29.9626809Z �[36;1mif ! git rev-parse "$MERGE_BASE:.circleci/docker"; then�[0m
2022-03-30T22:07:29.9627221Z �[36;1m  echo "Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit"�[0m
2022-03-30T22:07:29.9627520Z �[36;1m  exit 1�[0m
2022-03-30T22:07:29.9627681Z �[36;1mfi�[0m
2022-03-30T22:07:29.9627901Z �[36;1mPREVIOUS_DOCKER_TAG=$(git rev-parse "$MERGE_BASE:.circleci/docker")�[0m
2022-03-30T22:07:29.9628229Z �[36;1m# If no image exists but the hash is the same as the previous hash then we should error out here�[0m
2022-03-30T22:07:29.9628599Z �[36;1mif [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then�[0m
2022-03-30T22:07:29.9628930Z �[36;1m  echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"�[0m
2022-03-30T22:07:29.9629329Z �[36;1m  echo "       contact the PyTorch team to restore the original images"�[0m
2022-03-30T22:07:29.9629562Z �[36;1m  exit 1�[0m
2022-03-30T22:07:29.9629728Z �[36;1mfi�[0m
2022-03-30T22:07:29.9629912Z �[36;1mecho ::set-output name=rebuild::yes�[0m
2022-03-30T22:07:29.9639851Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2022-03-30T22:07:29.9640074Z env:
2022-03-30T22:07:29.9640219Z   IN_CI: 1
2022-03-30T22:07:29.9640379Z   IS_GHA: 1
2022-03-30T22:07:29.9640558Z   GIT_DEFAULT_BRANCH: master
2022-03-30T22:07:29.9640779Z   BASE_REVISION: 7eade146193d4a87958763fbc83a2d40c07faf8e

pull / win-vs2019-cuda11.3-py3 / test (default, 1, 2, windows.8xlarge.nvidia.gpu) (7/8)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-03-31T00:55:53.0676222Z AssertionError: 11 unit test(s) failed:

2022-03-31T00:55:52.7373238Z 
2022-03-31T00:55:52.7373871Z OK (skipped=1)
2022-03-31T00:55:52.7374359Z 
2022-03-31T00:55:52.7375131Z Generating XML reports...
2022-03-31T00:55:52.7377412Z Generated XML report: test-reports\python-unittest\distributed.test_c10d_gloo\TEST-TimeoutTest-20220331005551.xml
2022-03-31T00:55:53.0671879Z Traceback (most recent call last):
2022-03-31T00:55:53.0672815Z   File "distributed/test_c10d_gloo.py", line 2336, in <module>
2022-03-31T00:55:53.0673544Z     run_tests()
2022-03-31T00:55:53.0674557Z   File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\testing\_internal\common_utils.py", line 633, in run_tests
2022-03-31T00:55:53.0675564Z     assert len(failed_tests) == 0, "{} unit test(s) failed:\n\t{}".format(
2022-03-31T00:55:53.0676222Z AssertionError: 11 unit test(s) failed:
2022-03-31T00:55:53.0677367Z 	DistributedDataParallelTest.test_ddp_comm_hook_future_passing_cpu
2022-03-31T00:55:53.0678554Z 	DistributedDataParallelTest.test_ddp_comm_hook_register_just_once
2022-03-31T00:55:53.0679738Z 	DistributedDataParallelTest.test_ddp_comm_hook_sparse_gradients
2022-03-31T00:55:53.0681023Z 	DistributedDataParallelTest.test_ddp_invalid_comm_hook_init
2022-03-31T00:55:53.0682209Z 	DistributedDataParallelTest.test_ddp_invalid_comm_hook_return_type
2022-03-31T00:55:53.0683356Z 	DistributedDataParallelTest.test_gloo_backend_cpu_module
2022-03-31T00:55:53.0684517Z 	DistributedDataParallelTest.test_gloo_backend_cpu_module_grad_is_view
2022-03-31T00:55:53.0685629Z 	DistributedDataParallelTest.test_ignored_output
2022-03-31T00:55:53.0686794Z 	DistributedDataParallelTest.test_ignored_output_with_unused_parameters
2022-03-31T00:55:53.0687968Z 	DistributedDataParallelTest.test_sparse_gradients

pull / deploy-linux-xenial-cuda11.3-py3.7-gcc7 / build (8/8)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

2022-03-30T22:07:31.7792333Z �[36;1m echo "ERR...t available for the merge-base of your branch"�[0m

2022-03-30T22:07:31.7789410Z �[36;1mfi�[0m
2022-03-30T22:07:31.7789638Z �[36;1m# Covers the case where a previous tag doesn't exist for the tree�[0m
2022-03-30T22:07:31.7789972Z �[36;1m# this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly�[0m
2022-03-30T22:07:31.7790285Z �[36;1mif ! git rev-parse "$MERGE_BASE:.circleci/docker"; then�[0m
2022-03-30T22:07:31.7790629Z �[36;1m  echo "Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit"�[0m
2022-03-30T22:07:31.7790911Z �[36;1m  exit 1�[0m
2022-03-30T22:07:31.7791146Z �[36;1mfi�[0m
2022-03-30T22:07:31.7791370Z �[36;1mPREVIOUS_DOCKER_TAG=$(git rev-parse "$MERGE_BASE:.circleci/docker")�[0m
2022-03-30T22:07:31.7791701Z �[36;1m# If no image exists but the hash is the same as the previous hash then we should error out here�[0m
2022-03-30T22:07:31.7792006Z �[36;1mif [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then�[0m
2022-03-30T22:07:31.7792333Z �[36;1m  echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"�[0m
2022-03-30T22:07:31.7792668Z �[36;1m  echo "       contact the PyTorch team to restore the original images"�[0m
2022-03-30T22:07:31.7792960Z �[36;1m  exit 1�[0m
2022-03-30T22:07:31.7793127Z �[36;1mfi�[0m
2022-03-30T22:07:31.7793311Z �[36;1mecho ::set-output name=rebuild::yes�[0m
2022-03-30T22:07:31.7803412Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2022-03-30T22:07:31.7803631Z env:
2022-03-30T22:07:31.7803777Z   IN_CI: 1
2022-03-30T22:07:31.7803941Z   IS_GHA: 1
2022-03-30T22:07:31.7804159Z   BASE_REVISION: 7eade146193d4a87958763fbc83a2d40c07faf8e
2022-03-30T22:07:31.7804562Z   DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7:a2c09c6009bb8a10cbb45a8c5f7c573593289be0

2 failures not recognized by patterns:

Job	Step	Action
^{pull / pytorch-xla-linux-bionic-py3.7-clang8 / test (xla, 1, 1, linux.2xlarge)}	^Test	🔁 rerun
^{Lint / flake8-py3}	^{Fail if there were any warnings}	🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

As per the design in #72138, introduce ReplicatedParameter and replace DDP parameters with ReplicatedParameters and DDP buffers as ReplicatedTensors. Differential Revision: [D35162813](https://our.internmc.facebook.com/intern/diff/D35162813/) ghstack-source-id: 152286552 Pull Request resolved: #74787

pritamdamania87 · 2022-03-26T00:59:33Z

torch/distributed/_shard/replicated_tensor.py


        return True
+
+class ReplicatedParameter(ReplicatedTensor, torch.nn.Parameter):


@albanD I couldn't pass in a ReplicatedTensor to torch.nn.Parameter and have isinstance() still return ReplicatedTensor. As a result, I had to do something messy like this. Not sure if there is a better way to do this.

Yes that's the plan from #73459

cc @jbschlosser

This is a bit delayed due to my medical leave, but wanted to leave an update here for posterity: #73459 is in now - please try it out if you'd like :)

mrshenli

Is this a test or do we plan to land this soon? If it's the latter, shall we do some micro-benchmarks to make sure there is no perf regression?

kumpera · 2022-03-28T16:46:12Z

torch/nn/parallel/distributed.py

            )

+        # After syncing, tag them as replicated.
+        from torch.distributed._shard.replicated_tensor import ReplicatedTensor, ReplicatedParameter


Shouldn't we check if each param or buffer is replicated before converting them?
Similarly, shouldn't we record which ones we wrapped and use that when unwrapping, instead of relying on parameters_to_ignore.

Another thing, is using parameters_to_ignore to filter buffers just a naming artifact?

Similarly, shouldn't we record which ones we wrapped and use that when unwrapping, instead of relying on parameters_to_ignore.

So parameters_to_ignore is slightly orthogonal to tagging something as replicated or not since that option is actually controlling which parameters DDP ignores completely. That is why we always need to rely on it and ensure we don't tag parameters_to_ignore as Replicated at all.

Another thing, is using parameters_to_ignore to filter buffers just a naming artifact?

It's an unfortunate name since it is filtering buffers as well in other places. I'll probably fix this name in a follow up PR.

Oops, I added self.parameters_to_ignore a couple of years ago and agree that a better name is parameters_and_buffers_to_ignore.

As per the design in #72138, introduce ReplicatedParameter and replace DDP parameters with ReplicatedParameters and DDP buffers as ReplicatedTensors. Differential Revision: [D35162813](https://our.internmc.facebook.com/intern/diff/D35162813/) [ghstack-poisoned]

pritamdamania87 · 2022-03-30T01:34:03Z

@mrshenli The plan is to hopefully land this change, however I think there are some tricky cases to take care of first. I'll probably ensure I can make CI happy first (tagged the PR as WIP) and then try a benchmark as well to ensure no regressions.

As per the design in #72138, introduce ReplicatedParameter and replace DDP parameters with ReplicatedParameters and DDP buffers as ReplicatedTensors. Differential Revision: [D35162813](https://our.internmc.facebook.com/intern/diff/D35162813/) [ghstack-poisoned]

Pull Request resolved: #74787 As per the design in #72138, introduce ReplicatedParameter and replace DDP parameters with ReplicatedParameters and DDP buffers as ReplicatedTensors. ghstack-source-id: 152544758 Differential Revision: [D35162813](https://our.internmc.facebook.com/intern/diff/D35162813/)

pritamdamania87 · 2022-03-30T05:29:56Z

@pytorchbot retest this please

As per the design in #72138, introduce ReplicatedParameter and replace DDP parameters with ReplicatedParameters and DDP buffers as ReplicatedTensors. Differential Revision: [D35162813](https://our.internmc.facebook.com/intern/diff/D35162813/) [ghstack-poisoned]

Pull Request resolved: #74787 As per the design in #72138, introduce ReplicatedParameter and replace DDP parameters with ReplicatedParameters and DDP buffers as ReplicatedTensors. ghstack-source-id: 152632043 Differential Revision: [D35162813](https://our.internmc.facebook.com/intern/diff/D35162813/)

rohan-varma

Are we considering enabling this change with a flag such as DDP_USE_REPLICATED to phase the rollout? I would advocate for this unless we're very confident about performance implications and reliability of ReplicatedTensor, as it would also have a large blast radius in OSS, and we may not be completely aware of it until the 1.12 release happens.

rohan-varma · 2022-04-06T00:32:13Z

test/distributed/_shard/test_replicated_tensor.py

+            self.assertIsInstance(out, ReplicatedTensor)
+
+            # Validate buffers not replicated after forward.
+            self.assertNotIsInstance(ddp.module.foo, ReplicatedTensor)


Don't we synchronize the buffers so it should be replicated?

rohan-varma · 2022-04-06T00:32:41Z

test/distributed/_shard/test_replicated_tensor.py

+        params = list(model.parameters())
+        self.assertEqual(2, len(params))
+        self.assertEqual(model.weight, params[0])
+        self.assertEqual(model.bias, params[1])


should we add self.assertTrue(all(isinstance(replicatedParam, param) for param in model.parameters())?

rohan-varma · 2022-04-06T00:35:48Z

torch/nn/parallel/distributed.py

        self.gradient_as_bucket_view = gradient_as_bucket_view
-        if hasattr(module, "_ddp_params_and_buffers_to_ignore"):
-            self.parameters_to_ignore = module._ddp_params_and_buffers_to_ignore
+        if hasattr(self.module, "_ddp_params_and_buffers_to_ignore"):


Curious why this change is needed?

rohan-varma · 2022-04-06T00:36:14Z

torch/nn/parallel/distributed.py

        # In debug mode, build a mapping of parameter index -> parameter.
        param_to_name_mapping = self._build_debug_param_to_name_mapping(parameters)
+        # Sync params and buffers. Ensures all DDP models start off at the same value.
+        self._sync_params_and_buffers(authoritative_rank=0)


Shall we only mark this after the sync completes successfully?

rohan-varma · 2022-04-06T00:37:43Z

torch/nn/parallel/distributed.py

+            if not buffers_only:
+                for param_name, param in module.named_parameters(recurse=False):
+                    if (param.requires_grad and f'{module_name}.{param_name}'
+                            not in self.parameters_to_ignore and not isinstance(param, ReplicatedParameter)):


does the latter check mean that it is possible to call this function on already replicated parameters? Should we try to make it so that the application doesn't do this because it would be unnecessary?

rohan-varma · 2022-04-06T00:38:23Z

torch/nn/parallel/distributed.py

            )

+        # After syncing, tag them as replicated.
+        from torch.distributed._shard.replicated_tensor import ReplicatedTensor, ReplicatedParameter


Oops, I added self.parameters_to_ignore a couple of years ago and agree that a better name is parameters_and_buffers_to_ignore.

rohan-varma · 2022-04-06T00:38:46Z

torch/nn/parallel/distributed.py

+                for param_name, param in module.named_parameters(recurse=False):
+                    if (param.requires_grad and f'{module_name}.{param_name}'
+                            not in self.parameters_to_ignore and not isinstance(param, ReplicatedParameter)):
+                        module.register_parameter(param_name, ReplicatedParameter(param))


Do we not de-register the original, non-replicated parameters?

pritamdamania87 · 2022-04-06T00:48:46Z

Closing in favor of: #75299

pritamdamania87 requested review from H-Huang, mingzhe09088, mrshenli, rohan-varma and zhaojuanmao as code owners March 26, 2022 00:57

facebook-github-bot added the cla signed label Mar 26, 2022

pritamdamania87 mentioned this pull request Mar 26, 2022

[shard] Add basic math ops to ShardedTensor and add ReplicatedTensor inter-op #73703

Closed

facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Mar 26, 2022

pritamdamania87 requested a review from wanchaol March 26, 2022 00:57

pritamdamania87 commented Mar 26, 2022

View reviewed changes

mrshenli reviewed Mar 28, 2022

View reviewed changes

kumpera reviewed Mar 28, 2022

View reviewed changes

pritamdamania87 changed the title ~~Add ReplicatedParameter and ReplicatedTensor to DDP.~~ [WIP] Add ReplicatedParameter and ReplicatedTensor to DDP. Mar 30, 2022

pritamdamania87 changed the title ~~[WIP] Add ReplicatedParameter and ReplicatedTensor to DDP.~~ Add ReplicatedParameter and ReplicatedTensor to DDP. Mar 30, 2022

pritamdamania87 requested a review from awgu as a code owner March 30, 2022 21:56

rohan-varma reviewed Apr 6, 2022

View reviewed changes

pritamdamania87 closed this Apr 6, 2022

facebook-github-bot deleted the gh/pritamdamania87/225/head branch May 6, 2022 14:17


		return True

		class ReplicatedParameter(ReplicatedTensor, torch.nn.Parameter):

Add ReplicatedParameter and ReplicatedTensor to DDP. #74787

Add ReplicatedParameter and ReplicatedTensor to DDP. #74787

Uh oh!

Conversation

pritamdamania87 commented Mar 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Mar 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

💊 CI failures summary and remediations

🕵️ 8 new failures recognized by patterns

pull / win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge) (1/8)

pull / win-vs2019-cuda11.3-py3 / test (default, 2, 2, windows.8xlarge.nvidia.gpu) (2/8)

pull / win-vs2019-cuda11.3-py3 / test (force_on_cpu, 1, 1, windows.4xlarge) (3/8)

pull / linux-xenial-py3.7-gcc5.4 / test (distributed, 1, 1, linux.2xlarge) (4/8)

pull / linux-xenial-cuda11.3-py3.7-gcc7 / build (5/8)

pull / linux-xenial-cuda11.3-py3.7-gcc7-bazel-test / build-and-test (6/8)

pull / win-vs2019-cuda11.3-py3 / test (default, 1, 2, windows.8xlarge.nvidia.gpu) (7/8)

pull / deploy-linux-xenial-cuda11.3-py3.7-gcc7 / build (8/8)

2 failures not recognized by patterns:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pritamdamania87 commented Mar 30, 2022

Uh oh!

pritamdamania87 commented Mar 30, 2022

Uh oh!

rohan-varma left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pritamdamania87 commented Apr 6, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

pritamdamania87 commented Mar 26, 2022 •

edited

Loading

facebook-github-bot commented Mar 26, 2022 •

edited

Loading