[FSDP] Add grad accumulation without `no_sync()` #73535

awgu · 2022-02-28T21:15:14Z

Stack from ghstack:

[FSDP][BE] Change assert to assertEqual #73787 [FSDP][BE] Change assert to assertEqual
[Easy][FSDP] Fix warning render #73786 [Easy][FSDP] Fix warning render
[FSDP] Add grad accumulation without no_sync() #73535 [FSDP] Add grad accumulation without no_sync()

Overview

This adds FSDP gradient accumulation without no_sync(), which comparatively has more network bandwidth demand but less GPU memory requirement per worker.
This fixes a bug in the no_sync() testing, where the CPU offloading and backward prefetch arguments were not propagating to the FullyShardedDataParallel constructor.
This adds p_assert() (taken from Fairscale), which prints the assert error message before raising the AssertionError. It is meant to be used when running in the autograd backward context since otherwise the error message is swallowed, giving a unhelpful error like:

<built-in method run_backward of torch._C._EngineBase object at 0x7f1fd518dc80> returned NULL without setting an error

NOTE: Gradient accumulation without no_sync() is not currently compatible with CPU offloading.

Test Plan
I augmented the tests to test gradient accumulation interleaving iterations accumulating with and without no_sync().

Differential Revision: D34533546

[ghstack-poisoned]

pytorch-bot · 2022-02-28T21:15:19Z

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/pytorch/pytorch/blob/22cfb1fac06126a6da784e002486b12b189b51d7/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default
Add ciflow labels to this PR to trigger more builds:

Workflows	Labels (bold enabled)	Status
Triggered Workflows
linux-binary-conda	`ciflow/binaries`, `ciflow/binaries_conda`, `ciflow/default`	✅ triggered
linux-binary-libtorch-cxx11-abi	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	✅ triggered
linux-binary-libtorch-pre-cxx11	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	✅ triggered
linux-binary-manywheel	`ciflow/binaries`, `ciflow/binaries_wheel`, `ciflow/default`	✅ triggered
linux-bionic-py3.7-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/noarch`, `ciflow/trunk`	✅ triggered
linux-bionic-rocm4.5-py3.7	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/rocm`, `ciflow/trunk`	✅ triggered
linux-docs	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/docs`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-vulkan-bionic-py3.7-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`, `ciflow/vulkan`	✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7-bazel-test	`ciflow/all`, `ciflow/bazel`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3-clang5-mobile-build	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`, `ciflow/trunk`	✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-clang7-asan	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/sanitizers`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-clang7-onnx	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/onnx`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc7	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
linux-xenial-py3.7-gcc7-no-ops	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
macos-arm64-binary-conda	`ciflow/binaries`, `ciflow/binaries_conda`, `ciflow/default`	✅ triggered
macos-arm64-binary-wheel	`ciflow/binaries`, `ciflow/binaries_wheel`, `ciflow/default`	✅ triggered
macos-binary-conda	`ciflow/binaries`, `ciflow/binaries_conda`, `ciflow/default`	✅ triggered
macos-binary-libtorch-cxx11-abi	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	✅ triggered
macos-binary-libtorch-pre-cxx11	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	✅ triggered
macos-binary-wheel	`ciflow/binaries`, `ciflow/binaries_wheel`, `ciflow/default`	✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/trunk`	✅ triggered
win-vs2019-cpu-py3	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/trunk`, `ciflow/win`	✅ triggered
win-vs2019-cuda11.3-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/trunk`, `ciflow/win`	✅ triggered
windows-binary-libtorch-cxx11-abi	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	✅ triggered
windows-binary-libtorch-pre-cxx11	`ciflow/binaries`, `ciflow/binaries_libtorch`, `ciflow/default`	✅ triggered
windows-binary-wheel	`ciflow/binaries`, `ciflow/binaries_wheel`, `ciflow/default`	✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
docker-builds	`ciflow/all`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-arm64	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/scheduled`	🚫 skipped
ios-12-5-1-arm64-coreml	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/scheduled`	🚫 skipped
ios-12-5-1-arm64-custom-ops	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/scheduled`	🚫 skipped
ios-12-5-1-arm64-metal	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/scheduled`	🚫 skipped
ios-12-5-1-x86-64	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
ios-12-5-1-x86-64-coreml	`ciflow/all`, `ciflow/ios`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/slow`, `ciflow/trunk`	🚫 skipped
linux-docs-push	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
linux-xenial-cuda11.3-py3.7-gcc7-no-ops	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
macos-10-15-py3-arm64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
macos-11-py3-x86-64	`ciflow/all`, `ciflow/macos`, `ciflow/trunk`	🚫 skipped
parallelnative-linux-xenial-py3.7-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-linux-bionic-cuda11.5-py3.7-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`, `ciflow/slow`, `ciflow/slow-gradcheck`	🚫 skipped
periodic-linux-xenial-cuda11.3-py3.7-gcc7-debug	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-win-vs2019-cuda11.5-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/scheduled`, `ciflow/win`	🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`	🚫 skipped
pytorch-xla-linux-bionic-py3.7-clang8	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`, `ciflow/trunk`, `ciflow/xla`	🚫 skipped

facebook-github-bot · 2022-02-28T21:15:20Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/73535
📄 Preview docs built from this PR
📄 Preview C++ docs built from this PR
🔧 Opt-in to CIFlow to control what jobs run on your PRs

💊 CI failures summary and remediations

As of commit ab03d0e (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

[ghstack-poisoned]

ghstack-source-id: d32c434 Pull Request resolved: #73535

[ghstack-poisoned]

ghstack-source-id: 8f046d4 Pull Request resolved: #73535

**Overview** This adds FSDP gradient accumulation without `no_sync()`, which comparatively has more network bandwidth demand but less GPU memory requirement per worker. **Test Plan** I augmented the tests to test gradient accumulation without `no_sync()` and also interleaving iterations accumulating with and without `no_sync()`. [ghstack-poisoned]

**Overview** This adds FSDP gradient accumulation without `no_sync()`, which comparatively has more network bandwidth demand but less GPU memory requirement per worker. This also adds `p_assert()` (taken from Fairscale), which prints the assert error message before raising the `AssertionError`. It is meant to be used when running in the autograd backward context since otherwise the error message is swallowed, giving a unhelpful error like: ``` <built-in method run_backward of torch._C._EngineBase object at 0x7f1fd518dc80> returned NULL without setting an error ``` **Test Plan** I augmented the tests to test gradient accumulation without `no_sync()` and also interleaving iterations accumulating with and without `no_sync()`. [ghstack-poisoned]

awgu · 2022-02-28T22:10:15Z

@awgu has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

rohan-varma

Thanks for turning this around so quickly! A couple of small comments, will stamp after those are addressed!

rohan-varma · 2022-02-28T22:24:30Z

torch/distributed/fsdp/fully_sharded_data_parallel.py

+                    )
+                    param._saved_grad_shard.data += output.data  # type: ignore[attr-defined]
+                else:
+                    param._saved_grad_shard = output.data  # type: ignore[attr-defined]


are we switching from using output to output.data in non-grad accumulation use case?

Good question. This is good for me to clarify. Was there any reason to use output before?

In my understanding, it does not matter since we are not supporting taking the gradient of the gradient, so we do not need to record any operations on output in the autograd graph. Fairscale uses output.data in both code paths, though I do not see why either way is better for the non-gradient accumulation case.

rohan-varma · 2022-02-28T22:25:34Z

test/distributed/fsdp/test_fsdp_grad_acc.py

-        [2, 4],
+        "configs",
+        [
+            [_GradAccConfig(True, 4)],


nit: if you want, maybe pass in named args here, so future developers know what it is

rohan-varma · 2022-02-28T22:26:07Z

test/distributed/fsdp/test_fsdp_grad_acc.py

+            [_GradAccConfig(True, 4)],
+            [_GradAccConfig(False, 4)],
+            [_GradAccConfig(True, 2), _GradAccConfig(False, 2), _GradAccConfig(True, 2)],
+            [_GradAccConfig(False, 2), _GradAccConfig(True, 2), _GradAccConfig(False, 2)],


Wondering why we have duplicated configs?

I wanted to test interleaving both ways:

with no_sync() -> without no_sync() -> with no_sync()

without no_sync() -> with no_sync() -> without no_sync()

The reason I wanted two separate tests is that it could matter what was the last accumulation mode right before gradient synchronization.

Good call! Let's keep it then and add a small comment so no one removes it in the future.

rohan-varma · 2022-02-28T22:27:11Z

test/distributed/fsdp/test_fsdp_grad_acc.py

-        [2, 4],
+        "configs",
+        [
+            [_GradAccConfig(True, 4)],


to reduce # of tests, can we just have:

use_context = true, interval={2,4}
use_context = false, interval={2,4}

Given your comments last time, I thought about how to keep the set of tests minimal. I feel like these 4 configs test distinct things:

Gradient accumulation with no_sync()

Gradient accumulation without no_sync()

Gradient accumulation interleaving no_sync() and without no_sync(), where the last iteration before synchronizing gradients is in no_sync()

Gradient accumulation interleaving no_sync() and without no_sync(), where the last iteration before synchronizing gradients is outside no_sync()

My concern is that I do not want to overfit the current working implementation. It not obvious to me that 1) and 2) imply 3) and 4), and if we only had 3) and 4) and they break, then we would probably end up rewriting 1) and 2).

Sounds good! Considering you've gone through the tradeoff this makes sense to leave as is

rohan-varma · 2022-02-28T22:35:38Z

torch/distributed/fsdp/fully_sharded_data_parallel.py

                    # Average grad by world_size for consistency with PyTorch DDP.
                    output.div_(self.gradient_postdivide_factor)
-                param.grad.data = output
+                accumulate_grad = getattr(param, "_saved_grad_shard", None) is not None


Might be useful to add a small comment about how gradient accumulation without no_sync is implemented. From my understanding:

During backward, we point an attribute _saved_grad_shard to the gradient shard

If we are accumulating gradients, we accumulate it on _saved_grad_shard

When finalizing backwards before running the optimizer, we point p.grad to the saved grad shard so that the optimizer works on the right accumulated gradient.

rohan-varma · 2022-02-28T22:40:15Z

torch/distributed/fsdp/fully_sharded_data_parallel.py

+                can_accumulate_grad = p.grad.device == p.data.device and \
+                    p.grad.size() == p._local_shard.shape  # type: ignore[attr-defined]
+                if can_accumulate_grad:
+                    p._saved_grad_shard = p.grad.data  # type: ignore[attr-defined]


add comment here that this is what makes the gradient accumulation work?

rohan-varma · 2022-02-28T22:41:09Z

torch/distributed/fsdp/fully_sharded_data_parallel.py

                p.grad.size() != p._orig_size  # type: ignore[attr-defined]
                or p.grad.device != p.device
            ):
+                can_accumulate_grad = p.grad.device == p.data.device and \


Can we have a unittest where can_accumulate_grad = False and the user tries to accumulate grads and we raise an appropriate error?

After some investigation, I think that having a non-silent error for the case of gradient accumulation outside no_sync() while using CPU offloading requires some non-trivial re-design. It may be easiest to leave this error as silent for now and work on adding the compatibility itself.

(Some clarifying questions and comments)
Suppose we are using CPU offloading.

Outside no_sync(), if we want to accumulate gradients, are we performing the addition between the existing gradient and the newly-reduced gradient on CPU or on GPU? If on CPU, then we should perform a device-to-host transfer of the reduced gradient. If on GPU, then we should perform a host-to-device transfer of the existing gradient and a device-to-host transfer to re-offload the result.

Inside no_sync(), the existing implementation does not offload any gradients to CPU. Rather, the gradients are held in GPU memory until the first iteration outside no_sync(), which performs the gradient synchronization. At the end of that iteration's backward pass, the synchronized gradient shard is offloaded to CPU. Should we include any warning or message at runtime to the explain this behavior to the user? I added a note about it to the no_sync() docstring.

The challenge behind a non-silent error is that the no_sync() + CPU offloading case conflicts with the non-no_sync() + CPU offloading case.

The crux is that accumulating gradients using no_sync() follows the pattern of: accumulate for N-1 iterations inside no_sync() and execute 1 normal iteration outside no_sync().

That final normal iteration is indistinguishable from a non-no_sync() iteration unless we track something like a bool flag indicating that the last iteration was inside no_sync().

I am reluctant to add such a flag since it is solely a patch for one case and may be hiding the underlying design problem, but I am open to your thoughts.

For the final iteration coming out of no_sync(), the gradient is still on GPU, so performing the accumulation computation param._saved_grad_shard.data += output.data has no issue.

For a non-no_sync() iteration (after the first), the gradient was previously offloaded to CPU, so performing the accumulation computation param._saved_grad_shard.data += output.data has conflicting devices.

When we are using gradient accumulation outside of no_sync + CPU offload, don't we already raise an appropriate error on L1336 of this PR? And is it possible to add a unittest for this?

That is an internal assert. Given the current implementation, that assert will never get triggered. I added it because the logic is quite complicated, and I wanted to demonstrate that if we are in the non-no_sync() + CPU offloading case, then we should never be in that branch.

(or more directly, the contrapositive: if we are in that branch, we should not be in the non-no_sync() + CPU offloading case, meaning that in particular we must not be CPU offloading)

**Overview** This adds FSDP gradient accumulation without `no_sync()`, which comparatively has more network bandwidth demand but less GPU memory requirement per worker. This also adds `p_assert()` (taken from Fairscale), which prints the assert error message before raising the `AssertionError`. It is meant to be used when running in the autograd backward context since otherwise the error message is swallowed, giving a unhelpful error like: ``` <built-in method run_backward of torch._C._EngineBase object at 0x7f1fd518dc80> returned NULL without setting an error ``` **Test Plan** I augmented the tests to test gradient accumulation without `no_sync()` and also interleaving iterations accumulating with and without `no_sync()`. Differential Revision: [D34533546](https://our.internmc.facebook.com/intern/diff/D34533546) [ghstack-poisoned]

awgu · 2022-03-03T15:32:57Z

@awgu has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

**Overview** - This adds FSDP gradient accumulation without `no_sync()`, which comparatively has more network bandwidth demand but less GPU memory requirement per worker. - This fixes a bug in the `no_sync()` testing, where the CPU offloading and backward prefetch arguments were not propagating to the `FullyShardedDataParallel` constructor. - This adds `p_assert()` (taken from Fairscale), which prints the assert error message before raising the `AssertionError`. It is meant to be used when running in the autograd backward context since otherwise the error message is swallowed, giving a unhelpful error like: ``` <built-in method run_backward of torch._C._EngineBase object at 0x7f1fd518dc80> returned NULL without setting an error ``` NOTE: Gradient accumulation without `no_sync()` is not currently compatible with CPU offloading. **Test Plan** I augmented the tests to test gradient accumulation without `no_sync()` and also interleaving iterations accumulating with and without `no_sync()`. Differential Revision: [D34533546](https://our.internmc.facebook.com/intern/diff/D34533546) [ghstack-poisoned]

ghstack-source-id: e550f57 Pull Request resolved: #73535

awgu · 2022-03-03T16:14:22Z

@awgu has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

awgu · 2022-03-03T16:16:31Z

torch/distributed/fsdp/fully_sharded_data_parallel.py

+                # try to accumulate gradients. FSDP accumulates gradients in
+                # the separate variable `p._saved_grad_shard` to leave `p.grad`
+                # for the per-iteration gradient.
+                if prev_iter_outside_no_sync:


Since this PR is getting a bit cluttered, I wanted to specifically point this part out. Previously, I presented the logic here incorrectly, but hopefully this should make sense now.

I think the precise condition of when to use p._saved_grad_shard is if the previous iteration was outside no_sync().

Suppose we have
(1) some iterations outside no_sync() ->
(2) some iterations inside no_sync() ->
(3) one iteration outside no_sync().

In the pre-backward hook of (3), the FSDP instance holds an unsharded gradient in p.grad, which is the result of accumulating gradients from (2).

It computes that iteration's gradient, which is accumulated with the existing p.grad from (2) via the autograd engine and still stored in p.grad.

In the post-backward hook, it reduce-scatters that accumulated gradient stored in p.grad.

After the reduce-scatter, it accumulates the accumulated gradient from (2) and (3) with the accumulated gradient from (1) saved in _saved_grad_shard. This "super-accumulated" gradient is stored in _saved_grad_shard.

This step shows why as long as the previous iteration was outside no_sync(), there may be a gradient to accumulate on the first future iteration also outside no_sync().

I had misunderstood the comments from Fairscale (see here). I did not realize that the conditioning on inside/outside no_sync() referred to the previous iteration.

awgu · 2022-03-03T16:18:37Z

torch/distributed/fsdp/fully_sharded_data_parallel.py

+                        f"existing grad shape={param._saved_grad_shard.shape} "
+                        f"new grad shape={output.shape}"  # type: ignore[attr-defined]
+                    )
+                    p_assert(


One more thing to point out: I previously had an assert like not self.cpu_offload.offload_params just as an internal assert to make sure that CPU offloading never takes this code path.

However, I changed it to a more direct assert here in case we distinguish between offloading parameters and offloading gradients in the future before we solve gradient accumulation with CPU offloading.

zhaojuanmao

solid tests! also left some minor comments

zhaojuanmao · 2022-03-04T07:04:31Z

torch/distributed/fsdp/fully_sharded_data_parallel.py

+                        f"existing grad device={param._saved_grad_shard.device} "
+                        f"new grad device={output.device}"  # type: ignore[attr-defined]
+                    )
+                    param._saved_grad_shard.data += output.data  # type: ignore[attr-defined]


will it work if removing the .data?

Yup, it still works. I will remove the .data for both in this line.

zhaojuanmao · 2022-03-04T07:28:22Z

torch/distributed/fsdp/fully_sharded_data_parallel.py

+                    # FSDP currently does not support gradient accumulation
+                    # outside `no_sync()` when using CPU offloading. Trying to
+                    # do so yields incorrect results since FSDP will use the
+                    # newly-reduced gradient instead of accumulating with any
+                    # existing gradient.


could we add a github issue to support grad accumulation in cpu offloading?

Done: #73784

zhaojuanmao · 2022-03-04T07:29:02Z

torch/distributed/fsdp/fully_sharded_data_parallel.py

+                    # newly-reduced gradient instead of accumulating with any
+                    # existing gradient.
+                    if not offloaded:
+                        p._saved_grad_shard = p.grad.data  # type: ignore[attr-defined]


could we add a warning for this case and warning the comment "# FSDP currently does not support gradient accumulation
# outside no_sync() when using CPU offloading. Trying to
# do so yields incorrect results since FSDP will use the
# newly-reduced gradient instead of accumulating with any
# existing gradient."

**Overview** - This adds FSDP gradient accumulation without `no_sync()`, which comparatively has more network bandwidth demand but less GPU memory requirement per worker. - This fixes a bug in the `no_sync()` testing, where the CPU offloading and backward prefetch arguments were not propagating to the `FullyShardedDataParallel` constructor. - This adds `p_assert()` (taken from Fairscale), which prints the assert error message before raising the `AssertionError`. It is meant to be used when running in the autograd backward context since otherwise the error message is swallowed, giving a unhelpful error like: ``` <built-in method run_backward of torch._C._EngineBase object at 0x7f1fd518dc80> returned NULL without setting an error ``` NOTE: Gradient accumulation without `no_sync()` is not currently compatible with CPU offloading. **Test Plan** I augmented the tests to test gradient accumulation without `no_sync()` and also interleaving iterations accumulating with and without `no_sync()`. Differential Revision: [D34533546](https://our.internmc.facebook.com/intern/diff/D34533546) [ghstack-poisoned]

awgu · 2022-03-04T14:52:22Z

@awgu has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

**Overview** - This adds FSDP gradient accumulation without `no_sync()`, which comparatively has more network bandwidth demand but less GPU memory requirement per worker. - This fixes a bug in the `no_sync()` testing, where the CPU offloading and backward prefetch arguments were not propagating to the `FullyShardedDataParallel` constructor. - This adds `p_assert()` (taken from Fairscale), which prints the assert error message before raising the `AssertionError`. It is meant to be used when running in the autograd backward context since otherwise the error message is swallowed, giving a unhelpful error like: ``` <built-in method run_backward of torch._C._EngineBase object at 0x7f1fd518dc80> returned NULL without setting an error ``` NOTE: Gradient accumulation without `no_sync()` is not currently compatible with CPU offloading. **Test Plan** I augmented the tests to test gradient accumulation interleaving iterations accumulating with and without `no_sync()`. Differential Revision: [D34533546](https://our.internmc.facebook.com/intern/diff/D34533546) [ghstack-poisoned]

awgu · 2022-03-04T15:52:52Z

@awgu has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

awgu · 2022-03-04T16:03:10Z

@awgu has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

awgu · 2022-03-07T02:42:28Z

@awgu has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

awgu · 2022-03-07T13:22:07Z

@awgu has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: Pull Request resolved: #73535 **Overview** - This adds FSDP gradient accumulation without `no_sync()`, which comparatively has more network bandwidth demand but less GPU memory requirement per worker. - This fixes a bug in the `no_sync()` testing, where the CPU offloading and backward prefetch arguments were not propagating to the `FullyShardedDataParallel` constructor. - This adds `p_assert()` (taken from Fairscale), which prints the assert error message before raising the `AssertionError`. It is meant to be used when running in the autograd backward context since otherwise the error message is swallowed, giving a unhelpful error like: ``` <built-in method run_backward of torch._C._EngineBase object at 0x7f1fd518dc80> returned NULL without setting an error ``` NOTE: Gradient accumulation without `no_sync()` is not currently compatible with CPU offloading. **Test Plan** I augmented the tests to test gradient accumulation interleaving iterations accumulating with and without `no_sync()`. After this diff: - QPS (ResNet): f328439897 - QPS (RoBERTa): f328440141 - Accuracy: f328442119 Before this diff (trunk): - QPS (ResNet): f328432756 - QPS (RoBERTa): f328436766 - Accuracy: f328437896 Test Plan: Imported from OSS Reviewed By: zhaojuanmao Differential Revision: D34533546 Pulled By: awgu fbshipit-source-id: 821d762dfad5f2b1e59adcb8e5cb7c277399040c

Summary: Pull Request resolved: pytorch/pytorch#73535 **Overview** - This adds FSDP gradient accumulation without `no_sync()`, which comparatively has more network bandwidth demand but less GPU memory requirement per worker. - This fixes a bug in the `no_sync()` testing, where the CPU offloading and backward prefetch arguments were not propagating to the `FullyShardedDataParallel` constructor. - This adds `p_assert()` (taken from Fairscale), which prints the assert error message before raising the `AssertionError`. It is meant to be used when running in the autograd backward context since otherwise the error message is swallowed, giving a unhelpful error like: ``` <built-in method run_backward of torch._C._EngineBase object at 0x7f1fd518dc80> returned NULL without setting an error ``` NOTE: Gradient accumulation without `no_sync()` is not currently compatible with CPU offloading. **Test Plan** I augmented the tests to test gradient accumulation interleaving iterations accumulating with and without `no_sync()`. After this diff: - QPS (ResNet): f328439897 - QPS (RoBERTa): f328440141 - Accuracy: f328442119 Before this diff (trunk): - QPS (ResNet): f328432756 - QPS (RoBERTa): f328436766 - Accuracy: f328437896 Test Plan: Imported from OSS Reviewed By: zhaojuanmao Differential Revision: D34533546 Pulled By: awgu fbshipit-source-id: 821d762dfad5f2b1e59adcb8e5cb7c277399040c (cherry picked from commit 746a5ea2720dcf87c376229b405a318396fe5769)

rohan-varma · 2022-03-10T17:52:48Z

@awgu I was thinking maybe we should add some documentation somewhere around the gradient accumulation that is supported in FSDP, how to use it and what are the tradeoffs?

awgu · 2022-03-11T16:25:17Z

@awgu I was thinking maybe we should add some documentation somewhere around the gradient accumulation that is supported in FSDP, how to use it and what are the tradeoffs?

I think this is a great idea, especially since right now I do not have good intuition for the tradeoffs either.

rohan-varma · 2022-03-13T23:17:06Z

#74153

[FSDP] Add grad accumulation without no_sync()

22cfb1f

[ghstack-poisoned]

pytorch-bot bot added the ciflow/default label Feb 28, 2022

facebook-github-bot added the cla signed label Feb 28, 2022

facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Feb 28, 2022

Update on "[FSDP] Add grad accumulation without no_sync()"

8538ff1

[ghstack-poisoned]

Update on "[FSDP] Add grad accumulation without no_sync()"

f3eb83f

[ghstack-poisoned]

desertfire pushed a commit that referenced this pull request Feb 28, 2022

[FSDP] Add grad accumulation without no_sync()

a1fc4d6

ghstack-source-id: d32c434 Pull Request resolved: #73535

Update on "[FSDP] Add grad accumulation without no_sync()"

f6584c9

[ghstack-poisoned]

Update on "[FSDP] Add grad accumulation without no_sync()"

fd03c09

[ghstack-poisoned]

desertfire pushed a commit that referenced this pull request Feb 28, 2022

[FSDP] Add grad accumulation without no_sync()

37974de

ghstack-source-id: 8f046d4 Pull Request resolved: #73535

awgu marked this pull request as ready for review February 28, 2022 22:12

awgu requested review from H-Huang, mingzhe09088, mrshenli, pritamdamania87, rohan-varma and zhaojuanmao as code owners February 28, 2022 22:12

rohan-varma reviewed Feb 28, 2022

View reviewed changes

desertfire pushed a commit that referenced this pull request Mar 3, 2022

[FSDP] Add grad accumulation without no_sync()

85d32b6

ghstack-source-id: e550f57 Pull Request resolved: #73535

awgu commented Mar 3, 2022

View reviewed changes

zhaojuanmao approved these changes Mar 4, 2022

View reviewed changes

zhaojuanmao added the ci/master label Mar 4, 2022

awgu mentioned this pull request Mar 4, 2022

[FSDP] Add Gradient Accumulation Outside no_sync() Compatibility with CPU Offloading #73784

Closed

This was referenced Mar 4, 2022

[Easy][FSDP] Fix warning render #73786

Closed

[FSDP][BE] Change assert to assertEqual #73787

Closed

pytorchmergebot closed this in 4a06b8d Mar 7, 2022

facebook-github-bot deleted the gh/awgu/8/head branch March 11, 2022 15:17

rohan-varma mentioned this pull request Mar 13, 2022

[BE][Docs][FSDP] Clarify microbatching support and tradeoffs #74153

Closed

awgu mentioned this pull request Apr 20, 2022

Support gradient accumulation without no_sync context manager in FSDP #72185

Closed

WBobby mentioned this pull request Aug 17, 2022

Add ROCm5.2.3/AMDGPU support for PyTorch WBobby/pytorch#2

Closed

[FSDP] Add grad accumulation without no_sync() #73535

[FSDP] Add grad accumulation without no_sync() #73535

Uh oh!

Conversation

awgu commented Feb 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Feb 28, 2022

⚛️ CI Flow

Uh oh!

facebook-github-bot commented Feb 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

💊 CI failures summary and remediations

Uh oh!

awgu commented Feb 28, 2022

Uh oh!

rohan-varma left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

awgu commented Mar 3, 2022

Uh oh!

awgu commented Mar 3, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhaojuanmao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

awgu commented Mar 4, 2022

Uh oh!

awgu commented Mar 4, 2022

Uh oh!

awgu commented Mar 4, 2022

[FSDP] Add grad accumulation without `no_sync()` #73535

[FSDP] Add grad accumulation without `no_sync()` #73535

awgu commented Feb 28, 2022 •

edited

Loading

facebook-github-bot commented Feb 28, 2022 •

edited

Loading