[ZeRO][BE] Clean up ZeRO tests #73842

awgu · 2022-03-06T23:22:22Z

Stack from ghstack:

[ZeRO][BE] Clean up ZeRO tests #73842 [ZeRO][BE] Clean up ZeRO tests

Overview
This cleans up the ZeroRedundancyOptimizer tests. I apologize for strong formatting changes mixed in with actually-beneficial changes. It was convenient to unify the formatting while doing a deep comb through the full test file.

The main non-formatting changes include:

Using @parametrize instead of manually including for loops over possible argument values
Removing the DEVICE global variable, which was used only for the TestZeroRedundancyOptimizerSingleRank tests, in favor of consistent usage of self.device in both TestZeroRedundancyOptimizerSingleRank and TestZeroRedundancyOptimizerDistributed
Moving assert ... == ... to self.assertEqual(..., ...) when the assert is part of the test's correctness
Removing the if self.rank >= self.world_size or (torch.cuda.is_available() and torch.cuda.device_count() < 2): conditional guards in favor of @common_distributed.skip_if_no_gpu for TestZeroRedundancyOptimizerDistributed
- For TestZeroRedundancyOptimizerDistributed, self.device is torch.device(self.rank) if CUDA is available, while self.world_size is at least 2, even if torch.cuda.device_count() == 1.
- The problematic case is exactly when torch.cuda.device_count() == 1 but self.world_size == 2 since then calling self.device on rank 1 will error. The existing conditional guard prevented this case for some tests, but it was not used consistently (e.g. test_multiple_groups()), which is most likely the reason for the hangs and resulting test flakiness. (From my experience landing the recent ZeRO constructor changes, the Windows environment uses a world size of 2 but only has 1 device available.)
- A more robust solution is to always use the skip_if_no_gpu decorator as long as the test uses self.device and CUDA is available. This is in line with the recommended SPSD usage of ZeRO.

Renaming test_multiple_groups() to test_nondefault_process_group()

The existing test_multiple_groups() was slightly misnamed. Also, it is only nontrivial for a world size of (at least) 4 since it tests using a process group including only even ranks. It was marked as flaky on Windows, and I believe this is because of the world size and torch.cuda.device_count() mismatch. Now, the test only uses GPU if there are enough available and falls back to CPU otherwise, which is safe since the test uses Gloo backend.

There was also a duplicated section, which I was unsure how to non-naively de-duplicate. The top half and bottom half are identical even though they claim to target fitting into the broadcast bucket and not fitting into the broadcast bucket:

pytorch/test/distributed/optim/test_zero_redundancy_optimizer.py

Lines 658 to 684 in 1d49711

    
           # Model fitting in the broadcast bucket 
        
           model = torch.nn.Sequential( 
        
               torch.nn.Linear(input_width, hidden), 
        
               torch.nn.Linear(hidden, target_width), 
        
           ).to(self.device) 
        
           # With SGD, Momentum is required to get a state to shard 
        
           optimizer = ZeroRedundancyOptimizer( 
        
               model.parameters(), optimizer_class=SGD, lr=0.1, momentum=0.99, process_group=process_group 
        
           ) 
        
           check(optimizer) 
        
           # Model not-fitting in the broadcast bucket 
        
           model = torch.nn.Sequential( 
        
               torch.nn.Linear(input_width, hidden), 
        
               torch.nn.Linear(hidden, target_width), 
        
           ).to(self.device) 
        
           # With SGD, Momentum is required to get a state to shard 
        
           optimizer = ZeroRedundancyOptimizer( 
        
               model.parameters(), 
        
               optimizer_class=SGD, 
        
               lr=0.1, 
        
               momentum=0.99, 
        
               process_group=process_group, 
        
           ) 
        
           check(optimizer)

Changing _test_zero_model_parallel() to not use CPU
- This is my own fault, having introduced this inefficiency last summer. It makes more sense to simply designate one of the two GPUs for a process to be its default device rather than routing through CPU.

Questions

How might we limit the runs for test_ddp_zero_overlap()? Because it parameterizes over many values, it contributes significantly to the time-to-signal. However, it is an experimental feature, so it is not critical that the tests run every time.

Differential Revision: D34675709

[ghstack-poisoned]

pytorch-bot · 2022-03-06T23:22:26Z

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/pytorch/pytorch/blob/a5368870a49ad03a7104b8312738f980f210e4db/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default
Add ciflow labels to this PR to trigger more builds:

Workflows	Labels (bold enabled)	Status
Triggered Workflows
linux-binary-conda	`ciflow/binaries`, `ciflow/binaries_conda`, `ciflow/default`	✅ triggered
linux-binary-libtorch-cxx11-abi	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	✅ triggered
linux-binary-libtorch-pre-cxx11	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	✅ triggered
linux-binary-manywheel	`ciflow/binaries`, `ciflow/binaries_wheel`, `ciflow/default`	✅ triggered
linux-bionic-py3.7-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/noarch`, `ciflow/trunk`	✅ triggered
linux-bionic-rocm4.5-py3.7	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/rocm`, `ciflow/trunk`	✅ triggered
linux-docs	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/docs`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-vulkan-bionic-py3.7-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`, `ciflow/vulkan`	✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7-bazel-test	`ciflow/all`, `ciflow/bazel`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3-clang5-mobile-build	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`, `ciflow/trunk`	✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-clang7-asan	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/sanitizers`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-clang7-onnx	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/onnx`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc7	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc7-no-ops	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
macos-arm64-binary-conda	`ciflow/binaries`, `ciflow/binaries_conda`, `ciflow/default`	✅ triggered
macos-arm64-binary-wheel	`ciflow/binaries`, `ciflow/binaries_wheel`, `ciflow/default`	✅ triggered
macos-binary-conda	`ciflow/binaries`, `ciflow/binaries_conda`, `ciflow/default`	✅ triggered
macos-binary-libtorch-cxx11-abi	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	✅ triggered
macos-binary-libtorch-pre-cxx11	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	✅ triggered
macos-binary-wheel	`ciflow/binaries`, `ciflow/binaries_wheel`, `ciflow/default`	✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
win-vs2019-cpu-py3	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/trunk`, `ciflow/win`	✅ triggered
win-vs2019-cuda11.3-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/trunk`, `ciflow/win`	✅ triggered
windows-binary-libtorch-cxx11-abi	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	✅ triggered
windows-binary-libtorch-pre-cxx11	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	✅ triggered
windows-binary-wheel	`ciflow/binaries`, `ciflow/binaries_wheel`, `ciflow/default`	✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
docker-builds	`ciflow/all`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/scheduled`	🚫 skipped
ios-12-5-1-arm64-coreml	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/scheduled`	🚫 skipped
ios-12-5-1-arm64-custom-ops	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/scheduled`	🚫 skipped
ios-12-5-1-arm64-metal	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/scheduled`	🚫 skipped
ios-12-5-1-x86-64	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-x86-64-coreml	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/slow`, `ciflow/trunk`	🚫 skipped
linux-docs-push	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
linux-xenial-cuda11.3-py3.7-gcc7-no-ops	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
macos-10-15-py3-arm64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
macos-11-py3-x86-64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
parallelnative-linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-linux-bionic-cuda11.5-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`, `ciflow/slow`, `ciflow/slow-gradcheck`	🚫 skipped
periodic-linux-xenial-cuda11.3-py3.7-gcc7-debug	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-win-vs2019-cuda11.5-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/scheduled`, `ciflow/win`	🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
pytorch-xla-linux-bionic-py3.7-clang8	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`, `ciflow/xla`	🚫 skipped

facebook-github-bot · 2022-03-06T23:22:28Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/73842
📄 Preview docs built from this PR
📄 Preview C++ docs built from this PR
🔧 Opt-in to CIFlow to control what jobs run on your PRs

💊 CI failures summary and remediations

As of commit ee785f2 (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

[ghstack-poisoned]

ghstack-source-id: c440ba0 Pull Request resolved: #73842

awgu · 2022-03-07T02:49:15Z

@awgu has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

**Overview** This cleans up the `ZeroRedundancyOptimizer` tests. I apologize for strong formatting changes mixed in with actually-beneficial changes. It was convenient to unify the formatting while doing a deep comb through the full test file. The main non-formatting changes include: - Using `@parametrize` instead of manually including `for` loops over possible argument values - Removing the `DEVICE` global variable, which was used only for the `TestZeroRedundancyOptimizerSingleRank` tests, in favor of consistent usage of `self.device` - Moving `assert ... == ...` to `self.assertEqual(..., ...)` when the assert is part of the test's correctness - Removing the `if self.rank >= self.world_size or (torch.cuda.is_available() and torch.cuda.device_count() < 2):` conditional guards in favor of `@common_distributed.skip_if_no_gpu` - For `TestZeroRedundancyOptimizerDistributed`, `self.device` is `torch.device(self.rank)` if CUDA is available, which means that for world size of 4, `self.device` will error if `torch.cuda.device_count()` is 2 or 3. In other words, the existing conditional guard actually leaves a gap, only the existing test environments never hit that gap. It is more robust to use `skip_if_no_gpu` which requires the same number of GPUs as the world size so that `torch.device(self.rank)` can be used safely. - Renaming `test_multiple_groups()` to `test_nondefault_process_group()` - The existing `test_multiple_groups()` was slightly misnamed. Also, it is only nontrivial for a world size of (at least) 4 since it tests using a process group including only even ranks. It was marked as flaky on Windows, and I believe this is because of the world size and `torch.cuda.device_count()` mismatch. Now, the test only uses GPU if there are enough available and falls back to CPU otherwise, which is safe since the test uses Gloo backend. - There was also a duplicated section, which I was unsure how to non-naively de-duplicate. The top half and bottom half are identical even though they claim to target fitting into the broadcast bucket and not fitting into the broadcast bucket: https://github.com/pytorch/pytorch/blob/1d497114e7e13822df403082975d4b212d836207/test/distributed/optim/test_zero_redundancy_optimizer.py#L658-L684 - Changing `_test_zero_model_parallel()` to not use CPU - This is my own fault, having introduced this inefficiency last summer. It makes more sense to simply designate one of the two GPUs for a process to be its default device rather than routing through CPU. **Questions** - How might we limit the runs for `test_ddp_zero_overlap()`? Because it parameterizes over many values, it contributes significantly to the time-to-signal. However, it is an experimental feature, so it is not critical that the tests run every time. Differential Revision: [D34675709](https://our.internmc.facebook.com/intern/diff/D34675709) [ghstack-poisoned]

awgu · 2022-03-07T13:16:24Z

@awgu has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

**Overview** This cleans up the `ZeroRedundancyOptimizer` tests. I apologize for strong formatting changes mixed in with actually-beneficial changes. It was convenient to unify the formatting while doing a deep comb through the full test file. The main non-formatting changes include: - Using `@parametrize` instead of manually including `for` loops over possible argument values - Removing the `DEVICE` global variable, which was used only for the `TestZeroRedundancyOptimizerSingleRank` tests, in favor of consistent usage of `self.device` in both `TestZeroRedundancyOptimizerSingleRank` and `TestZeroRedundancyOptimizerDistributed` - Moving `assert ... == ...` to `self.assertEqual(..., ...)` when the assert is part of the test's correctness - Removing the `if self.rank >= self.world_size or (torch.cuda.is_available() and torch.cuda.device_count() < 2):` conditional guards in favor of `@common_distributed.skip_if_no_gpu` - For `TestZeroRedundancyOptimizerDistributed`, `self.device` is `torch.device(self.rank)` if CUDA is available, while `self.world_size` is at least 2, even if `torch.cuda.device_count() == 1`. - The problematic case is exactly when `torch.cuda.device_count() == 1` but `self.world_size == 2` since then calling `self.device` on rank 1 will error. The existing conditional guard prevented this case for some tests, but it was not used consistently (e.g. `test_multiple_groups()`), which is most likely the reason for the hangs and resulting test flakiness. (From my experience landing the recent ZeRO constructor changes, the Windows environment uses a world size of 2 but only has 1 device available.) - A more robust solution is to always use the `skip_if_no_gpu` decorator as long as the test uses `self.device` and CUDA is available. This is in line with the recommended SPSD usage of ZeRO. - Renaming `test_multiple_groups()` to `test_nondefault_process_group()` - The existing `test_multiple_groups()` was slightly misnamed. Also, it is only nontrivial for a world size of (at least) 4 since it tests using a process group including only even ranks. It was marked as flaky on Windows, and I believe this is because of the world size and `torch.cuda.device_count()` mismatch. Now, the test only uses GPU if there are enough available and falls back to CPU otherwise, which is safe since the test uses Gloo backend. - There was also a duplicated section, which I was unsure how to non-naively de-duplicate. The top half and bottom half are identical even though they claim to target fitting into the broadcast bucket and not fitting into the broadcast bucket: https://github.com/pytorch/pytorch/blob/1d497114e7e13822df403082975d4b212d836207/test/distributed/optim/test_zero_redundancy_optimizer.py#L658-L684 - Changing `_test_zero_model_parallel()` to not use CPU - This is my own fault, having introduced this inefficiency last summer. It makes more sense to simply designate one of the two GPUs for a process to be its default device rather than routing through CPU. **Questions** - How might we limit the runs for `test_ddp_zero_overlap()`? Because it parameterizes over many values, it contributes significantly to the time-to-signal. However, it is an experimental feature, so it is not critical that the tests run every time. Differential Revision: [D34675709](https://our.internmc.facebook.com/intern/diff/D34675709) [ghstack-poisoned]

ghstack-source-id: bfbac1e Pull Request resolved: #73842

awgu · 2022-03-07T16:36:23Z

@awgu has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

awgu · 2022-03-07T20:56:34Z

Some internal tests are failing due to the test name parser (interacting with @parametrize). ~~I should have the fix ready locally, but I will wait to push to avoid re-triggering the CI tests.~~ Update: I decided to update the diff with the fix.

The takeaway is that the Github CI is now passing, even for the previously broken tests on Windows (e.g. #66059).

**Overview** This cleans up the `ZeroRedundancyOptimizer` tests. I apologize for strong formatting changes mixed in with actually-beneficial changes. It was convenient to unify the formatting while doing a deep comb through the full test file. The main non-formatting changes include: - Using `@parametrize` instead of manually including `for` loops over possible argument values - Removing the `DEVICE` global variable, which was used only for the `TestZeroRedundancyOptimizerSingleRank` tests, in favor of consistent usage of `self.device` in both `TestZeroRedundancyOptimizerSingleRank` and `TestZeroRedundancyOptimizerDistributed` - Moving `assert ... == ...` to `self.assertEqual(..., ...)` when the assert is part of the test's correctness - Removing the `if self.rank >= self.world_size or (torch.cuda.is_available() and torch.cuda.device_count() < 2):` conditional guards in favor of `@common_distributed.skip_if_no_gpu` for `TestZeroRedundancyOptimizerDistributed` - For `TestZeroRedundancyOptimizerDistributed`, `self.device` is `torch.device(self.rank)` if CUDA is available, while `self.world_size` is at least 2, even if `torch.cuda.device_count() == 1`. - The problematic case is exactly when `torch.cuda.device_count() == 1` but `self.world_size == 2` since then calling `self.device` on rank 1 will error. The existing conditional guard prevented this case for some tests, but it was not used consistently (e.g. `test_multiple_groups()`), which is most likely the reason for the hangs and resulting test flakiness. (From my experience landing the recent ZeRO constructor changes, the Windows environment uses a world size of 2 but only has 1 device available.) - A more robust solution is to always use the `skip_if_no_gpu` decorator as long as the test uses `self.device` and CUDA is available. This is in line with the recommended SPSD usage of ZeRO. - Renaming `test_multiple_groups()` to `test_nondefault_process_group()` - The existing `test_multiple_groups()` was slightly misnamed. Also, it is only nontrivial for a world size of (at least) 4 since it tests using a process group including only even ranks. It was marked as flaky on Windows, and I believe this is because of the world size and `torch.cuda.device_count()` mismatch. Now, the test only uses GPU if there are enough available and falls back to CPU otherwise, which is safe since the test uses Gloo backend. - There was also a duplicated section, which I was unsure how to non-naively de-duplicate. The top half and bottom half are identical even though they claim to target fitting into the broadcast bucket and not fitting into the broadcast bucket: https://github.com/pytorch/pytorch/blob/1d497114e7e13822df403082975d4b212d836207/test/distributed/optim/test_zero_redundancy_optimizer.py#L658-L684 - Changing `_test_zero_model_parallel()` to not use CPU - This is my own fault, having introduced this inefficiency last summer. It makes more sense to simply designate one of the two GPUs for a process to be its default device rather than routing through CPU. **Questions** - How might we limit the runs for `test_ddp_zero_overlap()`? Because it parameterizes over many values, it contributes significantly to the time-to-signal. However, it is an experimental feature, so it is not critical that the tests run every time. Differential Revision: [D34675709](https://our.internmc.facebook.com/intern/diff/D34675709) [ghstack-poisoned]

ghstack-source-id: 50c4bc4 Pull Request resolved: #73842

awgu · 2022-03-08T00:52:01Z

@awgu has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

rohan-varma

LGTM, thanks!

rohan-varma · 2022-03-08T05:24:10Z

test/distributed/optim/test_zero_redundancy_optimizer.py

+        "use_gpu",
+        [True],
+        # Add `False` once the Gloo sync issue causing hangs is fixed
+        # See: https://github.com/pytorch/pytorch/issues/62300


wondering if we have plans to fix this issue in gloo?

cc @awgu @mrshenli @cbalioglu

Assigned #62300 to myself. I plan to work on it as part of the overhaul of our process group APIs.

Summary: Pull Request resolved: #73842 **Overview** This cleans up the `ZeroRedundancyOptimizer` tests. I apologize for strong formatting changes mixed in with actually-beneficial changes. It was convenient to unify the formatting while doing a deep comb through the full test file. The main non-formatting changes include: - Using `parametrize` instead of manually including `for` loops over possible argument values - Removing the `DEVICE` global variable, which was used only for the `TestZeroRedundancyOptimizerSingleRank` tests, in favor of consistent usage of `self.device` in both `TestZeroRedundancyOptimizerSingleRank` and `TestZeroRedundancyOptimizerDistributed` - Moving `assert ... == ...` to `self.assertEqual(..., ...)` when the assert is part of the test's correctness - Removing the `if self.rank >= self.world_size or (torch.cuda.is_available() and torch.cuda.device_count() < 2):` conditional guards in favor of `common_distributed.skip_if_no_gpu` for `TestZeroRedundancyOptimizerDistributed` - For `TestZeroRedundancyOptimizerDistributed`, `self.device` is `torch.device(self.rank)` if CUDA is available, while `self.world_size` is at least 2, even if `torch.cuda.device_count() == 1`. - The problematic case is exactly when `torch.cuda.device_count() == 1` but `self.world_size == 2` since then calling `self.device` on rank 1 will error. The existing conditional guard prevented this case for some tests, but it was not used consistently (e.g. `test_multiple_groups()`), which is most likely the reason for the hangs and resulting test flakiness. (From my experience landing the recent ZeRO constructor changes, the Windows environment uses a world size of 2 but only has 1 device available.) - A more robust solution is to always use the `skip_if_no_gpu` decorator as long as the test uses `self.device` and CUDA is available. This is in line with the recommended SPSD usage of ZeRO. - Renaming `test_multiple_groups()` to `test_nondefault_process_group()` - The existing `test_multiple_groups()` was slightly misnamed. Also, it is only nontrivial for a world size of (at least) 4 since it tests using a process group including only even ranks. It was marked as flaky on Windows, and I believe this is because of the world size and `torch.cuda.device_count()` mismatch. Now, the test only uses GPU if there are enough available and falls back to CPU otherwise, which is safe since the test uses Gloo backend. - There was also a duplicated section, which I was unsure how to non-naively de-duplicate. The top half and bottom half are identical even though they claim to target fitting into the broadcast bucket and not fitting into the broadcast bucket: https://github.com/pytorch/pytorch/blob/1d497114e7e13822df403082975d4b212d836207/test/distributed/optim/test_zero_redundancy_optimizer.py#L658-L684 - Changing `_test_zero_model_parallel()` to not use CPU - This is my own fault, having introduced this inefficiency last summer. It makes more sense to simply designate one of the two GPUs for a process to be its default device rather than routing through CPU. **Questions** - How might we limit the runs for `test_ddp_zero_overlap()`? Because it parameterizes over many values, it contributes significantly to the time-to-signal. However, it is an experimental feature, so it is not critical that the tests run every time. Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D34675709 Pulled By: awgu fbshipit-source-id: 71ce9ac968fb34415cd65206855b4bb5e67754fb

github-actions · 2022-03-08T13:15:54Z

Hey @awgu.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

Summary: Pull Request resolved: pytorch/pytorch#73842 **Overview** This cleans up the `ZeroRedundancyOptimizer` tests. I apologize for strong formatting changes mixed in with actually-beneficial changes. It was convenient to unify the formatting while doing a deep comb through the full test file. The main non-formatting changes include: - Using `parametrize` instead of manually including `for` loops over possible argument values - Removing the `DEVICE` global variable, which was used only for the `TestZeroRedundancyOptimizerSingleRank` tests, in favor of consistent usage of `self.device` in both `TestZeroRedundancyOptimizerSingleRank` and `TestZeroRedundancyOptimizerDistributed` - Moving `assert ... == ...` to `self.assertEqual(..., ...)` when the assert is part of the test's correctness - Removing the `if self.rank >= self.world_size or (torch.cuda.is_available() and torch.cuda.device_count() < 2):` conditional guards in favor of `common_distributed.skip_if_no_gpu` for `TestZeroRedundancyOptimizerDistributed` - For `TestZeroRedundancyOptimizerDistributed`, `self.device` is `torch.device(self.rank)` if CUDA is available, while `self.world_size` is at least 2, even if `torch.cuda.device_count() == 1`. - The problematic case is exactly when `torch.cuda.device_count() == 1` but `self.world_size == 2` since then calling `self.device` on rank 1 will error. The existing conditional guard prevented this case for some tests, but it was not used consistently (e.g. `test_multiple_groups()`), which is most likely the reason for the hangs and resulting test flakiness. (From my experience landing the recent ZeRO constructor changes, the Windows environment uses a world size of 2 but only has 1 device available.) - A more robust solution is to always use the `skip_if_no_gpu` decorator as long as the test uses `self.device` and CUDA is available. This is in line with the recommended SPSD usage of ZeRO. - Renaming `test_multiple_groups()` to `test_nondefault_process_group()` - The existing `test_multiple_groups()` was slightly misnamed. Also, it is only nontrivial for a world size of (at least) 4 since it tests using a process group including only even ranks. It was marked as flaky on Windows, and I believe this is because of the world size and `torch.cuda.device_count()` mismatch. Now, the test only uses GPU if there are enough available and falls back to CPU otherwise, which is safe since the test uses Gloo backend. - There was also a duplicated section, which I was unsure how to non-naively de-duplicate. The top half and bottom half are identical even though they claim to target fitting into the broadcast bucket and not fitting into the broadcast bucket: https://github.com/pytorch/pytorch/blob/1d497114e7e13822df403082975d4b212d836207/test/distributed/optim/test_zero_redundancy_optimizer.py#L658-L684 - Changing `_test_zero_model_parallel()` to not use CPU - This is my own fault, having introduced this inefficiency last summer. It makes more sense to simply designate one of the two GPUs for a process to be its default device rather than routing through CPU. **Questions** - How might we limit the runs for `test_ddp_zero_overlap()`? Because it parameterizes over many values, it contributes significantly to the time-to-signal. However, it is an experimental feature, so it is not critical that the tests run every time. Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D34675709 Pulled By: awgu fbshipit-source-id: 71ce9ac968fb34415cd65206855b4bb5e67754fb (cherry picked from commit 34e3dd0a184318ea9f63a1ee20cd14b111af3501)

[ZeRO][BE] Clean up ZeRO tests

a536887

[ghstack-poisoned]

pytorch-bot bot added the ciflow/default label Mar 6, 2022

facebook-github-bot added the cla signed label Mar 6, 2022

facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Mar 6, 2022

Update on "[ZeRO][BE] Clean up ZeRO tests"

2a9d576

[ghstack-poisoned]

awgu pushed a commit that referenced this pull request Mar 6, 2022

[ZeRO][BE] Clean up ZeRO tests

e28efb8

ghstack-source-id: c440ba0 Pull Request resolved: #73842

awgu added ci/windows labels Mar 7, 2022

awgu pushed a commit that referenced this pull request Mar 7, 2022

[ZeRO][BE] Clean up ZeRO tests

7b93de5

ghstack-source-id: bfbac1e Pull Request resolved: #73842

awgu marked this pull request as ready for review March 7, 2022 20:43

awgu requested review from H-Huang, mingzhe09088, mrshenli, pritamdamania87, rohan-varma and zhaojuanmao as code owners March 7, 2022 20:44

awgu pushed a commit that referenced this pull request Mar 8, 2022

[ZeRO][BE] Clean up ZeRO tests

585de7c

ghstack-source-id: 50c4bc4 Pull Request resolved: #73842

rohan-varma approved these changes Mar 8, 2022

View reviewed changes

rohan-varma reviewed Mar 8, 2022

View reviewed changes

pytorchmergebot closed this in 9012e8d Mar 8, 2022

facebook-github-bot deleted the gh/awgu/13/head branch March 11, 2022 15:17

awgu mentioned this pull request Mar 16, 2022

TestZeroRedundancyOptimizerDistributed.test_multiple_groups is failing intermittently #66059

Closed

WBobby mentioned this pull request Aug 17, 2022

Add ROCm5.2.3/AMDGPU support for PyTorch WBobby/pytorch#2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ZeRO][BE] Clean up ZeRO tests #73842

[ZeRO][BE] Clean up ZeRO tests #73842

Uh oh!

awgu commented Mar 6, 2022 •

edited

Loading

Uh oh!

pytorch-bot bot commented Mar 6, 2022

⚛️ CI Flow

Uh oh!

facebook-github-bot commented Mar 6, 2022 •

edited

Loading

Uh oh!

awgu commented Mar 7, 2022

Uh oh!

awgu commented Mar 7, 2022

Uh oh!

awgu commented Mar 7, 2022

Uh oh!

awgu commented Mar 7, 2022 •

edited

Loading

Uh oh!

awgu commented Mar 8, 2022

Uh oh!

rohan-varma left a comment

Uh oh!

rohan-varma Mar 8, 2022

Uh oh!

cbalioglu Mar 8, 2022

Uh oh!

github-actions bot commented Mar 8, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	# Model fitting in the broadcast bucket
	model = torch.nn.Sequential(
	torch.nn.Linear(input_width, hidden),
	torch.nn.Linear(hidden, target_width),
	).to(self.device)

	# With SGD, Momentum is required to get a state to shard
	optimizer = ZeroRedundancyOptimizer(
	model.parameters(), optimizer_class=SGD, lr=0.1, momentum=0.99, process_group=process_group
	)
	check(optimizer)

	# Model not-fitting in the broadcast bucket
	model = torch.nn.Sequential(
	torch.nn.Linear(input_width, hidden),
	torch.nn.Linear(hidden, target_width),
	).to(self.device)

	# With SGD, Momentum is required to get a state to shard
	optimizer = ZeroRedundancyOptimizer(
	model.parameters(),
	optimizer_class=SGD,
	lr=0.1,
	momentum=0.99,
	process_group=process_group,
	)
	check(optimizer)

[ZeRO][BE] Clean up ZeRO tests #73842

[ZeRO][BE] Clean up ZeRO tests #73842

Uh oh!

Conversation

awgu commented Mar 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 6, 2022

⚛️ CI Flow

Uh oh!

facebook-github-bot commented Mar 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

💊 CI failures summary and remediations

Uh oh!

awgu commented Mar 7, 2022

Uh oh!

awgu commented Mar 7, 2022

Uh oh!

awgu commented Mar 7, 2022

Uh oh!

awgu commented Mar 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

awgu commented Mar 8, 2022

Uh oh!

rohan-varma left a comment

Choose a reason for hiding this comment

Uh oh!

rohan-varma Mar 8, 2022

Choose a reason for hiding this comment

Uh oh!

cbalioglu Mar 8, 2022

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 8, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

awgu commented Mar 6, 2022 •

edited

Loading

facebook-github-bot commented Mar 6, 2022 •

edited

Loading

awgu commented Mar 7, 2022 •

edited

Loading