Skip to content

Conversation

@pritamdamania87
Copy link
Contributor

@pritamdamania87 pritamdamania87 commented Mar 26, 2022

Stack from ghstack (oldest at bottom):

As per the design in #72138,
introduce ReplicatedParameter and replace DDP parameters with
ReplicatedParameters and DDP buffers as ReplicatedTensors.

Differential Revision: D35162813

As per the design in #72138,
introduce ReplicatedParameter and replace DDP parameters with
ReplicatedParameters and DDP buffers as ReplicatedTensors.

Differential Revision: [D35162813](https://our.internmc.facebook.com/intern/diff/D35162813/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Mar 26, 2022

🔗 Helpful links

💊 CI failures summary and remediations

As of commit b248f29 (more details on the Dr. CI page):


  • 10/10 failures introduced in this PR

🕵️ 8 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See GitHub Actions build pull / win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge) (1/8)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-03-30T23:07:14.8839229Z AssertionError: 11 unit test(s) failed:
2022-03-30T23:07:14.7083964Z 
2022-03-30T23:07:14.7084246Z OK (skipped=1)
2022-03-30T23:07:14.7084452Z 
2022-03-30T23:07:14.7084783Z Generating XML reports...
2022-03-30T23:07:14.7085600Z Generated XML report: test-reports\python-unittest\distributed.test_c10d_gloo\TEST-TimeoutTest-20220330230714.xml
2022-03-30T23:07:14.8836062Z Traceback (most recent call last):
2022-03-30T23:07:14.8836778Z   File "distributed/test_c10d_gloo.py", line 2336, in <module>
2022-03-30T23:07:14.8837225Z     run_tests()
2022-03-30T23:07:14.8837894Z   File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\testing\_internal\common_utils.py", line 633, in run_tests
2022-03-30T23:07:14.8838719Z     assert len(failed_tests) == 0, "{} unit test(s) failed:\n\t{}".format(
2022-03-30T23:07:14.8839229Z AssertionError: 11 unit test(s) failed:
2022-03-30T23:07:14.8839929Z 	DistributedDataParallelTest.test_ddp_comm_hook_future_passing_cpu
2022-03-30T23:07:14.8840812Z 	DistributedDataParallelTest.test_ddp_comm_hook_register_just_once
2022-03-30T23:07:14.8841705Z 	DistributedDataParallelTest.test_ddp_comm_hook_sparse_gradients
2022-03-30T23:07:14.8842580Z 	DistributedDataParallelTest.test_ddp_invalid_comm_hook_init
2022-03-30T23:07:14.8843438Z 	DistributedDataParallelTest.test_ddp_invalid_comm_hook_return_type
2022-03-30T23:07:14.8844321Z 	DistributedDataParallelTest.test_gloo_backend_cpu_module
2022-03-30T23:07:14.8845212Z 	DistributedDataParallelTest.test_gloo_backend_cpu_module_grad_is_view
2022-03-30T23:07:14.8846076Z 	DistributedDataParallelTest.test_ignored_output
2022-03-30T23:07:14.8846947Z 	DistributedDataParallelTest.test_ignored_output_with_unused_parameters
2022-03-30T23:07:14.8847837Z 	DistributedDataParallelTest.test_sparse_gradients

See GitHub Actions build pull / win-vs2019-cuda11.3-py3 / test (default, 2, 2, windows.8xlarge.nvidia.gpu) (2/8)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-03-31T01:35:42.3899479Z AssertionError: 3 unit test(s) failed:
2022-03-31T01:35:42.0146843Z 
2022-03-31T01:35:42.0147449Z OK (skipped=1)
2022-03-31T01:35:42.0147902Z 
2022-03-31T01:35:42.0148601Z Generating XML reports...
2022-03-31T01:35:42.0151300Z Generated XML report: test-reports\dist-gloo\distributed.test_c10d_spawn_gloo\TEST-TestDistributedNNFunctionsGloo-20220331013539.xml
2022-03-31T01:35:42.3895098Z Traceback (most recent call last):
2022-03-31T01:35:42.3896052Z   File "distributed/test_c10d_spawn_gloo.py", line 278, in <module>
2022-03-31T01:35:42.3896725Z     run_tests()
2022-03-31T01:35:42.3897632Z   File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\testing\_internal\common_utils.py", line 633, in run_tests
2022-03-31T01:35:42.3898820Z     assert len(failed_tests) == 0, "{} unit test(s) failed:\n\t{}".format(
2022-03-31T01:35:42.3899479Z AssertionError: 3 unit test(s) failed:
2022-03-31T01:35:42.3900538Z 	DistributedDataParallelSingleProcessTest.test_cpu
2022-03-31T01:35:42.3902021Z 	DistributedDataParallelSingleProcessTest.test_cuda
2022-03-31T01:35:42.3903550Z 	DistributedDataParallelSingleProcessTest.test_rnn
2022-03-31T01:35:42.7959457Z Traceback (most recent call last):
2022-03-31T01:35:42.7960380Z   File "run_test.py", line 1054, in <module>
2022-03-31T01:35:42.7960859Z     main()
2022-03-31T01:35:42.7961349Z   File "run_test.py", line 1032, in main
2022-03-31T01:35:42.7961986Z     raise RuntimeError(err_message)
2022-03-31T01:35:42.7962829Z RuntimeError: distributed/test_c10d_spawn_gloo failed!
2022-03-31T01:35:43.2335205Z 

See GitHub Actions build pull / win-vs2019-cuda11.3-py3 / test (force_on_cpu, 1, 1, windows.4xlarge) (3/8)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-03-30T23:09:47.5664762Z AssertionError: 11 unit test(s) failed:
2022-03-30T23:09:47.3928152Z 
2022-03-30T23:09:47.3928434Z OK (skipped=1)
2022-03-30T23:09:47.3928642Z 
2022-03-30T23:09:47.3928973Z Generating XML reports...
2022-03-30T23:09:47.3929798Z Generated XML report: test-reports\python-unittest\distributed.test_c10d_gloo\TEST-TimeoutTest-20220330230946.xml
2022-03-30T23:09:47.5661677Z Traceback (most recent call last):
2022-03-30T23:09:47.5662357Z   File "distributed/test_c10d_gloo.py", line 2336, in <module>
2022-03-30T23:09:47.5662799Z     run_tests()
2022-03-30T23:09:47.5663498Z   File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\testing\_internal\common_utils.py", line 633, in run_tests
2022-03-30T23:09:47.5664268Z     assert len(failed_tests) == 0, "{} unit test(s) failed:\n\t{}".format(
2022-03-30T23:09:47.5664762Z AssertionError: 11 unit test(s) failed:
2022-03-30T23:09:47.5665445Z 	DistributedDataParallelTest.test_ddp_comm_hook_future_passing_cpu
2022-03-30T23:09:47.5666341Z 	DistributedDataParallelTest.test_ddp_comm_hook_register_just_once
2022-03-30T23:09:47.5667226Z 	DistributedDataParallelTest.test_ddp_comm_hook_sparse_gradients
2022-03-30T23:09:47.5668100Z 	DistributedDataParallelTest.test_ddp_invalid_comm_hook_init
2022-03-30T23:09:47.5668969Z 	DistributedDataParallelTest.test_ddp_invalid_comm_hook_return_type
2022-03-30T23:09:47.5669840Z 	DistributedDataParallelTest.test_gloo_backend_cpu_module
2022-03-30T23:09:47.5670718Z 	DistributedDataParallelTest.test_gloo_backend_cpu_module_grad_is_view
2022-03-30T23:09:47.5671742Z 	DistributedDataParallelTest.test_ignored_output
2022-03-30T23:09:47.5672618Z 	DistributedDataParallelTest.test_ignored_output_with_unused_parameters
2022-03-30T23:09:47.5673505Z 	DistributedDataParallelTest.test_sparse_gradients

See GitHub Actions build pull / linux-xenial-py3.7-gcc5.4 / test (distributed, 1, 1, linux.2xlarge) (4/8)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-03-30T22:50:36.3781424Z AssertionError: 4 unit test(s) failed:
2022-03-30T22:50:36.2070504Z 
2022-03-30T22:50:36.2070565Z OK
2022-03-30T22:50:36.2070656Z 
2022-03-30T22:50:36.2070750Z Generating XML reports...
2022-03-30T22:50:36.2115906Z Generated XML report: test-reports/python-unittest/distributed.rpc.test_tensorpipe_agent/TEST-TensorPipeThreeWorkersRemoteModuleTest-20220330225034.xml
2022-03-30T22:50:36.3771317Z Traceback (most recent call last):
2022-03-30T22:50:36.3779611Z   File "distributed/rpc/test_tensorpipe_agent.py", line 38, in <module>
2022-03-30T22:50:36.3780367Z     run_tests()
2022-03-30T22:50:36.3780824Z   File "/opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 634, in run_tests
2022-03-30T22:50:36.3781189Z     len(failed_tests), '\n\t'.join(failed_tests))
2022-03-30T22:50:36.3781424Z AssertionError: 4 unit test(s) failed:
2022-03-30T22:50:36.3781695Z 	TensorPipeDdpComparisonTest.test_ddp_dist_autograd_sparse_grads
2022-03-30T22:50:36.3782038Z 	TensorPipeDdpUnderDistAutogradTest.test_backward_ddp_inside
2022-03-30T22:50:36.3782379Z 	TensorPipeDdpUnderDistAutogradTest.test_backward_ddp_outside
2022-03-30T22:50:36.3782725Z 	TensorPipeDdpUnderDistAutogradTest.test_backward_ddp_outside_uneven_inputs
2022-03-30T22:50:36.4916884Z Traceback (most recent call last):
2022-03-30T22:50:36.4917158Z   File "test/run_test.py", line 1054, in <module>
2022-03-30T22:50:36.4919551Z     main()
2022-03-30T22:50:36.4919808Z   File "test/run_test.py", line 1032, in main
2022-03-30T22:50:36.4921674Z     raise RuntimeError(err_message)
2022-03-30T22:50:36.4921953Z RuntimeError: distributed/rpc/test_tensorpipe_agent failed!

See GitHub Actions build pull / linux-xenial-cuda11.3-py3.7-gcc7 / build (5/8)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

2022-03-30T22:07:33.0684272Z �[36;1m echo "ERR...t available for the merge-base of your branch"�[0m
2022-03-30T22:07:33.0681347Z �[36;1mfi�[0m
2022-03-30T22:07:33.0681574Z �[36;1m# Covers the case where a previous tag doesn't exist for the tree�[0m
2022-03-30T22:07:33.0681903Z �[36;1m# this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly�[0m
2022-03-30T22:07:33.0682225Z �[36;1mif ! git rev-parse "$MERGE_BASE:.circleci/docker"; then�[0m
2022-03-30T22:07:33.0682567Z �[36;1m  echo "Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit"�[0m
2022-03-30T22:07:33.0682927Z �[36;1m  exit 1�[0m
2022-03-30T22:07:33.0683081Z �[36;1mfi�[0m
2022-03-30T22:07:33.0683320Z �[36;1mPREVIOUS_DOCKER_TAG=$(git rev-parse "$MERGE_BASE:.circleci/docker")�[0m
2022-03-30T22:07:33.0683653Z �[36;1m# If no image exists but the hash is the same as the previous hash then we should error out here�[0m
2022-03-30T22:07:33.0683945Z �[36;1mif [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then�[0m
2022-03-30T22:07:33.0684272Z �[36;1m  echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"�[0m
2022-03-30T22:07:33.0684617Z �[36;1m  echo "       contact the PyTorch team to restore the original images"�[0m
2022-03-30T22:07:33.0684909Z �[36;1m  exit 1�[0m
2022-03-30T22:07:33.0685062Z �[36;1mfi�[0m
2022-03-30T22:07:33.0685256Z �[36;1mecho ::set-output name=rebuild::yes�[0m
2022-03-30T22:07:33.0695558Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2022-03-30T22:07:33.0695765Z env:
2022-03-30T22:07:33.0695926Z   IN_CI: 1
2022-03-30T22:07:33.0696086Z   IS_GHA: 1
2022-03-30T22:07:33.0696288Z   BASE_REVISION: 7eade146193d4a87958763fbc83a2d40c07faf8e
2022-03-30T22:07:33.0696704Z   DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7:a2c09c6009bb8a10cbb45a8c5f7c573593289be0

See GitHub Actions build pull / linux-xenial-cuda11.3-py3.7-gcc7-bazel-test / build-and-test (6/8)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

2022-03-30T22:07:29.9628930Z �[36;1m echo "ERR...t available for the merge-base of your branch"�[0m
2022-03-30T22:07:29.9625934Z �[36;1mfi�[0m
2022-03-30T22:07:29.9626160Z �[36;1m# Covers the case where a previous tag doesn't exist for the tree�[0m
2022-03-30T22:07:29.9626498Z �[36;1m# this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly�[0m
2022-03-30T22:07:29.9626809Z �[36;1mif ! git rev-parse "$MERGE_BASE:.circleci/docker"; then�[0m
2022-03-30T22:07:29.9627221Z �[36;1m  echo "Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit"�[0m
2022-03-30T22:07:29.9627520Z �[36;1m  exit 1�[0m
2022-03-30T22:07:29.9627681Z �[36;1mfi�[0m
2022-03-30T22:07:29.9627901Z �[36;1mPREVIOUS_DOCKER_TAG=$(git rev-parse "$MERGE_BASE:.circleci/docker")�[0m
2022-03-30T22:07:29.9628229Z �[36;1m# If no image exists but the hash is the same as the previous hash then we should error out here�[0m
2022-03-30T22:07:29.9628599Z �[36;1mif [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then�[0m
2022-03-30T22:07:29.9628930Z �[36;1m  echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"�[0m
2022-03-30T22:07:29.9629329Z �[36;1m  echo "       contact the PyTorch team to restore the original images"�[0m
2022-03-30T22:07:29.9629562Z �[36;1m  exit 1�[0m
2022-03-30T22:07:29.9629728Z �[36;1mfi�[0m
2022-03-30T22:07:29.9629912Z �[36;1mecho ::set-output name=rebuild::yes�[0m
2022-03-30T22:07:29.9639851Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2022-03-30T22:07:29.9640074Z env:
2022-03-30T22:07:29.9640219Z   IN_CI: 1
2022-03-30T22:07:29.9640379Z   IS_GHA: 1
2022-03-30T22:07:29.9640558Z   GIT_DEFAULT_BRANCH: master
2022-03-30T22:07:29.9640779Z   BASE_REVISION: 7eade146193d4a87958763fbc83a2d40c07faf8e

See GitHub Actions build pull / win-vs2019-cuda11.3-py3 / test (default, 1, 2, windows.8xlarge.nvidia.gpu) (7/8)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-03-31T00:55:53.0676222Z AssertionError: 11 unit test(s) failed:
2022-03-31T00:55:52.7373238Z 
2022-03-31T00:55:52.7373871Z OK (skipped=1)
2022-03-31T00:55:52.7374359Z 
2022-03-31T00:55:52.7375131Z Generating XML reports...
2022-03-31T00:55:52.7377412Z Generated XML report: test-reports\python-unittest\distributed.test_c10d_gloo\TEST-TimeoutTest-20220331005551.xml
2022-03-31T00:55:53.0671879Z Traceback (most recent call last):
2022-03-31T00:55:53.0672815Z   File "distributed/test_c10d_gloo.py", line 2336, in <module>
2022-03-31T00:55:53.0673544Z     run_tests()
2022-03-31T00:55:53.0674557Z   File "C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\testing\_internal\common_utils.py", line 633, in run_tests
2022-03-31T00:55:53.0675564Z     assert len(failed_tests) == 0, "{} unit test(s) failed:\n\t{}".format(
2022-03-31T00:55:53.0676222Z AssertionError: 11 unit test(s) failed:
2022-03-31T00:55:53.0677367Z 	DistributedDataParallelTest.test_ddp_comm_hook_future_passing_cpu
2022-03-31T00:55:53.0678554Z 	DistributedDataParallelTest.test_ddp_comm_hook_register_just_once
2022-03-31T00:55:53.0679738Z 	DistributedDataParallelTest.test_ddp_comm_hook_sparse_gradients
2022-03-31T00:55:53.0681023Z 	DistributedDataParallelTest.test_ddp_invalid_comm_hook_init
2022-03-31T00:55:53.0682209Z 	DistributedDataParallelTest.test_ddp_invalid_comm_hook_return_type
2022-03-31T00:55:53.0683356Z 	DistributedDataParallelTest.test_gloo_backend_cpu_module
2022-03-31T00:55:53.0684517Z 	DistributedDataParallelTest.test_gloo_backend_cpu_module_grad_is_view
2022-03-31T00:55:53.0685629Z 	DistributedDataParallelTest.test_ignored_output
2022-03-31T00:55:53.0686794Z 	DistributedDataParallelTest.test_ignored_output_with_unused_parameters
2022-03-31T00:55:53.0687968Z 	DistributedDataParallelTest.test_sparse_gradients

See GitHub Actions build pull / deploy-linux-xenial-cuda11.3-py3.7-gcc7 / build (8/8)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

2022-03-30T22:07:31.7792333Z �[36;1m echo "ERR...t available for the merge-base of your branch"�[0m
2022-03-30T22:07:31.7789410Z �[36;1mfi�[0m
2022-03-30T22:07:31.7789638Z �[36;1m# Covers the case where a previous tag doesn't exist for the tree�[0m
2022-03-30T22:07:31.7789972Z �[36;1m# this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly�[0m
2022-03-30T22:07:31.7790285Z �[36;1mif ! git rev-parse "$MERGE_BASE:.circleci/docker"; then�[0m
2022-03-30T22:07:31.7790629Z �[36;1m  echo "Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit"�[0m
2022-03-30T22:07:31.7790911Z �[36;1m  exit 1�[0m
2022-03-30T22:07:31.7791146Z �[36;1mfi�[0m
2022-03-30T22:07:31.7791370Z �[36;1mPREVIOUS_DOCKER_TAG=$(git rev-parse "$MERGE_BASE:.circleci/docker")�[0m
2022-03-30T22:07:31.7791701Z �[36;1m# If no image exists but the hash is the same as the previous hash then we should error out here�[0m
2022-03-30T22:07:31.7792006Z �[36;1mif [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then�[0m
2022-03-30T22:07:31.7792333Z �[36;1m  echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"�[0m
2022-03-30T22:07:31.7792668Z �[36;1m  echo "       contact the PyTorch team to restore the original images"�[0m
2022-03-30T22:07:31.7792960Z �[36;1m  exit 1�[0m
2022-03-30T22:07:31.7793127Z �[36;1mfi�[0m
2022-03-30T22:07:31.7793311Z �[36;1mecho ::set-output name=rebuild::yes�[0m
2022-03-30T22:07:31.7803412Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2022-03-30T22:07:31.7803631Z env:
2022-03-30T22:07:31.7803777Z   IN_CI: 1
2022-03-30T22:07:31.7803941Z   IS_GHA: 1
2022-03-30T22:07:31.7804159Z   BASE_REVISION: 7eade146193d4a87958763fbc83a2d40c07faf8e
2022-03-30T22:07:31.7804562Z   DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7:a2c09c6009bb8a10cbb45a8c5f7c573593289be0

2 failures not recognized by patterns:

Job Step Action
GitHub Actions pull / pytorch-xla-linux-bionic-py3.7-clang8 / test (xla, 1, 1, linux.2xlarge) Test 🔁 rerun
GitHub Actions Lint / flake8-py3 Fail if there were any warnings 🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Mar 26, 2022
pritamdamania87 pushed a commit that referenced this pull request Mar 26, 2022
As per the design in #72138,
introduce ReplicatedParameter and replace DDP parameters with
ReplicatedParameters and DDP buffers as ReplicatedTensors.

Differential Revision: [D35162813](https://our.internmc.facebook.com/intern/diff/D35162813/)

ghstack-source-id: 152286552
Pull Request resolved: #74787

return True

class ReplicatedParameter(ReplicatedTensor, torch.nn.Parameter):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albanD I couldn't pass in a ReplicatedTensor to torch.nn.Parameter and have isinstance() still return ReplicatedTensor. As a result, I had to do something messy like this. Not sure if there is a better way to do this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's the plan from #73459

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit delayed due to my medical leave, but wanted to leave an update here for posterity: #73459 is in now - please try it out if you'd like :)

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a test or do we plan to land this soon? If it's the latter, shall we do some micro-benchmarks to make sure there is no perf regression?

)

# After syncing, tag them as replicated.
from torch.distributed._shard.replicated_tensor import ReplicatedTensor, ReplicatedParameter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we check if each param or buffer is replicated before converting them?
Similarly, shouldn't we record which ones we wrapped and use that when unwrapping, instead of relying on parameters_to_ignore.

Another thing, is using parameters_to_ignore to filter buffers just a naming artifact?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, shouldn't we record which ones we wrapped and use that when unwrapping, instead of relying on parameters_to_ignore.

So parameters_to_ignore is slightly orthogonal to tagging something as replicated or not since that option is actually controlling which parameters DDP ignores completely. That is why we always need to rely on it and ensure we don't tag parameters_to_ignore as Replicated at all.

Another thing, is using parameters_to_ignore to filter buffers just a naming artifact?

It's an unfortunate name since it is filtering buffers as well in other places. I'll probably fix this name in a follow up PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I added self.parameters_to_ignore a couple of years ago and agree that a better name is parameters_and_buffers_to_ignore.

As per the design in #72138,
introduce ReplicatedParameter and replace DDP parameters with
ReplicatedParameters and DDP buffers as ReplicatedTensors.

Differential Revision: [D35162813](https://our.internmc.facebook.com/intern/diff/D35162813/)

[ghstack-poisoned]
@pritamdamania87 pritamdamania87 changed the title Add ReplicatedParameter and ReplicatedTensor to DDP. [WIP] Add ReplicatedParameter and ReplicatedTensor to DDP. Mar 30, 2022
@pritamdamania87 pritamdamania87 changed the title [WIP] Add ReplicatedParameter and ReplicatedTensor to DDP. Add ReplicatedParameter and ReplicatedTensor to DDP. Mar 30, 2022
@pritamdamania87
Copy link
Contributor Author

@mrshenli The plan is to hopefully land this change, however I think there are some tricky cases to take care of first. I'll probably ensure I can make CI happy first (tagged the PR as WIP) and then try a benchmark as well to ensure no regressions.

As per the design in #72138,
introduce ReplicatedParameter and replace DDP parameters with
ReplicatedParameters and DDP buffers as ReplicatedTensors.

Differential Revision: [D35162813](https://our.internmc.facebook.com/intern/diff/D35162813/)

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this pull request Mar 30, 2022
Pull Request resolved: #74787

As per the design in #72138,
introduce ReplicatedParameter and replace DDP parameters with
ReplicatedParameters and DDP buffers as ReplicatedTensors.
ghstack-source-id: 152544758

Differential Revision: [D35162813](https://our.internmc.facebook.com/intern/diff/D35162813/)
@pritamdamania87
Copy link
Contributor Author

@pytorchbot retest this please

As per the design in #72138,
introduce ReplicatedParameter and replace DDP parameters with
ReplicatedParameters and DDP buffers as ReplicatedTensors.

Differential Revision: [D35162813](https://our.internmc.facebook.com/intern/diff/D35162813/)

[ghstack-poisoned]
As per the design in #72138,
introduce ReplicatedParameter and replace DDP parameters with
ReplicatedParameters and DDP buffers as ReplicatedTensors.

Differential Revision: [D35162813](https://our.internmc.facebook.com/intern/diff/D35162813/)

[ghstack-poisoned]
As per the design in #72138,
introduce ReplicatedParameter and replace DDP parameters with
ReplicatedParameters and DDP buffers as ReplicatedTensors.

Differential Revision: [D35162813](https://our.internmc.facebook.com/intern/diff/D35162813/)

[ghstack-poisoned]
@pritamdamania87 pritamdamania87 requested a review from awgu as a code owner March 30, 2022 21:56
As per the design in #72138,
introduce ReplicatedParameter and replace DDP parameters with
ReplicatedParameters and DDP buffers as ReplicatedTensors.

Differential Revision: [D35162813](https://our.internmc.facebook.com/intern/diff/D35162813/)

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this pull request Mar 30, 2022
Pull Request resolved: #74787

As per the design in #72138,
introduce ReplicatedParameter and replace DDP parameters with
ReplicatedParameters and DDP buffers as ReplicatedTensors.
ghstack-source-id: 152632043

Differential Revision: [D35162813](https://our.internmc.facebook.com/intern/diff/D35162813/)
Copy link
Contributor

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we considering enabling this change with a flag such as DDP_USE_REPLICATED to phase the rollout? I would advocate for this unless we're very confident about performance implications and reliability of ReplicatedTensor, as it would also have a large blast radius in OSS, and we may not be completely aware of it until the 1.12 release happens.

self.assertIsInstance(out, ReplicatedTensor)

# Validate buffers not replicated after forward.
self.assertNotIsInstance(ddp.module.foo, ReplicatedTensor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we synchronize the buffers so it should be replicated?

params = list(model.parameters())
self.assertEqual(2, len(params))
self.assertEqual(model.weight, params[0])
self.assertEqual(model.bias, params[1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add self.assertTrue(all(isinstance(replicatedParam, param) for param in model.parameters())?

self.gradient_as_bucket_view = gradient_as_bucket_view
if hasattr(module, "_ddp_params_and_buffers_to_ignore"):
self.parameters_to_ignore = module._ddp_params_and_buffers_to_ignore
if hasattr(self.module, "_ddp_params_and_buffers_to_ignore"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why this change is needed?

# In debug mode, build a mapping of parameter index -> parameter.
param_to_name_mapping = self._build_debug_param_to_name_mapping(parameters)
# Sync params and buffers. Ensures all DDP models start off at the same value.
self._sync_params_and_buffers(authoritative_rank=0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we only mark this after the sync completes successfully?

if not buffers_only:
for param_name, param in module.named_parameters(recurse=False):
if (param.requires_grad and f'{module_name}.{param_name}'
not in self.parameters_to_ignore and not isinstance(param, ReplicatedParameter)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the latter check mean that it is possible to call this function on already replicated parameters? Should we try to make it so that the application doesn't do this because it would be unnecessary?

)

# After syncing, tag them as replicated.
from torch.distributed._shard.replicated_tensor import ReplicatedTensor, ReplicatedParameter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I added self.parameters_to_ignore a couple of years ago and agree that a better name is parameters_and_buffers_to_ignore.

for param_name, param in module.named_parameters(recurse=False):
if (param.requires_grad and f'{module_name}.{param_name}'
not in self.parameters_to_ignore and not isinstance(param, ReplicatedParameter)):
module.register_parameter(param_name, ReplicatedParameter(param))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not de-register the original, non-replicated parameters?

@pritamdamania87
Copy link
Contributor Author

Closing in favor of: #75299

@facebook-github-bot facebook-github-bot deleted the gh/pritamdamania87/225/head branch May 6, 2022 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants