[RFC] Separate CPU offload activation to its own wrapper #85459

rohan-varma · 2022-09-22T06:12:40Z

Stack from ghstack (oldest at bottom):

Passing in offload_to_cpu=True to checkpoint_wrapper is a bit confusing, because this causes the activation checkpoint args to be ignored and we do CPU offloading. This isn't ideal from API design perspective, so proposing to make offload_wrapper its own concept.

Now, offload to CPU + checkpoint can be composed together, such as

# apply AC to transformer layers
apply_ac_wrapper(model, checkpoint_wrapper, check_fn=lambda mod: isinstance(mod, TransformerLayer))
# offload the rest of activations to CPU
model = offload_wrapper(model)

Will polish / add tests if this proposal sounds good.

Differential Revision: D39719854

Passing in `offload_to_cpu=True` to checkpoint_wrapper is a bit confusing, because this causes the activation checkpoint args to be ignored and we do CPU offloading. This isn't ideal from API design perspective, so proposing to make `offload_wrapper` its own concept. Now, offload to CPU + checkpoint can be composed together, such as ``` # apply AC to transformer layers apply_ac_wrapper(model, checkpoint_wrapper, check_fn=lambda mod: isinstance(mod, TransformerLayer)) # offload the rest of activations to CPU model = offload_wrapper(model) ``` Will polish / add tests if this proposal sounds good. Differential Revision: [D39719854](https://our.internmc.facebook.com/intern/diff/D39719854/) [ghstack-poisoned]

pytorch-bot · 2022-09-22T06:12:42Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/85459

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 59d1dc3:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Passing in `offload_to_cpu=True` to checkpoint_wrapper is a bit confusing, because this causes the activation checkpoint args to be ignored and we do CPU offloading. This isn't ideal from API design perspective, so proposing to make `offload_wrapper` its own concept. Now, offload to CPU + checkpoint can be composed together, such as ``` # apply AC to transformer layers apply_ac_wrapper(model, checkpoint_wrapper, check_fn=lambda mod: isinstance(mod, TransformerLayer)) # offload the rest of activations to CPU model = offload_wrapper(model) ``` Will polish / add tests if this proposal sounds good. Differential Revision: [D39719854](https://our.internmc.facebook.com/intern/diff/D39719854/) ghstack-source-id: 168124511 Pull Request resolved: #85459

vadimkantorov · 2022-09-22T08:16:26Z

Related: #70166

Also, if autocast also existed as explicit wrapper, these problems would be easier to understand: #85340 #45910

Passing in `offload_to_cpu=True` to checkpoint_wrapper is a bit confusing, because this causes the activation checkpoint args to be ignored and we do CPU offloading. This isn't ideal from API design perspective, so proposing to make `offload_wrapper` its own concept. Now, offload to CPU + checkpoint can be composed together, such as ``` # apply AC to transformer layers apply_ac_wrapper(model, checkpoint_wrapper, check_fn=lambda mod: isinstance(mod, TransformerLayer)) # offload the rest of activations to CPU model = offload_wrapper(model) ``` Will polish / add tests if this proposal sounds good. Differential Revision: [D39719854](https://our.internmc.facebook.com/intern/diff/D39719854/) [ghstack-poisoned]

Pull Request resolved: #85459 Passing in `offload_to_cpu=True` to checkpoint_wrapper is a bit confusing, because this causes the activation checkpoint args to be ignored and we do CPU offloading. This isn't ideal from API design perspective, so proposing to make `offload_wrapper` its own concept. Now, offload to CPU + checkpoint can be composed together, such as ``` # apply AC to transformer layers apply_ac_wrapper(model, checkpoint_wrapper, check_fn=lambda mod: isinstance(mod, TransformerLayer)) # offload the rest of activations to CPU model = offload_wrapper(model) ``` Will polish / add tests if this proposal sounds good. ghstack-source-id: 168196009 Differential Revision: [D39719854](https://our.internmc.facebook.com/intern/diff/D39719854/)

torch/distributed/algorithms/_checkpoint/checkpoint_wrapper.py

awgu · 2022-09-24T15:32:28Z

torch/distributed/algorithms/_checkpoint/checkpoint_wrapper.py

+        super().__init__(mod)
+
+    def forward(self, *args, **kwargs):
+        with save_on_cpu(pin_memory=True):


Tangentially related:

You should not over-allocate pinned memory. Doing so can reduce overall system performance because it reduces the amount of physical memory available to the operating system and other programs. How much is too much is difficult to tell in advance, so as with all optimizations, test your applications and the systems they run on for optimal performance parameters.

https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/

For activation offload, we operate under the assumption that we have unlimited memory to pin. I think it is worthwhile to investigate the robustness of this assumption.

Without any offloading, the total activation size is upper bounded by the GPU memory size, but with offloading, that can be increased significantly (e.g. <= the number of activations times the GPU memory size). For some workloads that activation offloading targets (e.g. large batch size and activations; small model), I imagine that the total activation size can be larger than the GPU memory size. This is exacerbated if CPU offloading is simultaneously enabled.

That being said, if the CPU memory size is sufficiently large and not used for much else during program execution, then pinning significant amounts of CPU memory may not be a problem. I think we should simply try to get some better intuition here.

This makes sense. I filed an issue: #86097. In the medium term if we do determine we shouldn't always pin memory, we might make this flag configurable or in the long-term possibly something that can be autotuned.

awgu

Left some nits.

I think renaming to apply_activation_wrapper() in the short term is still reasonable. That being said, I am curious:

What is the cost to renaming it now? The previous PR renamed it, so it seems to be okay to rename.
What are the relevant future plans/discussions that may affect the naming?

awgu · 2022-10-03T14:26:31Z

torch/distributed/algorithms/_checkpoint/checkpoint_wrapper.py

        )

+    def forward(self, *args, **kwargs):
+        raise ValueError("Implement me")


nit:

Suggested change

raise ValueError("Implement me")

raise NotImplementedError("Subclasses should implement this")

awgu · 2022-10-03T14:27:42Z

torch/distributed/algorithms/_checkpoint/checkpoint_wrapper.py

+
+class CheckpointWrapper(ActivationWrapper):
+    """
+    An nn.Module that wraps another nn.Module with checkpointing. Note that this


nit: Do we want these in code type?

Suggested change

An nn.Module that wraps another nn.Module with checkpointing. Note that this

An ``nn.Module`` that wraps another ``nn.Module`` with checkpointing. Note that this

awgu · 2022-10-03T14:29:34Z

torch/distributed/algorithms/_checkpoint/checkpoint_wrapper.py

+        *checkpoint_fn_args,
+        **checkpoint_fn_kwargs,
+    ):
+        super().__init__()


The lint error seems real?

Suggested change

super().__init__()

super().__init__(mod)

torch/distributed/algorithms/_checkpoint/checkpoint_wrapper.py

facebook-github-bot · 2022-10-04T00:16:35Z

/easycla

As part of the transition to the PyTorch Foundation, this project now requires contributions be covered under the new CLA. See #85559 for additional details.

This comment will trigger a new check of this PR. If you are already covered, you will simply see a new "EasyCLA" check that passes. If you are not covered, a bot will leave a new comment with a link to sign.

Passing in `offload_to_cpu=True` to checkpoint_wrapper is a bit confusing, because this causes the activation checkpoint args to be ignored and we do CPU offloading. This isn't ideal from API design perspective, so proposing to make `offload_wrapper` its own concept. Now, offload to CPU + checkpoint can be composed together, such as ``` # apply AC to transformer layers apply_ac_wrapper(model, checkpoint_wrapper, check_fn=lambda mod: isinstance(mod, TransformerLayer)) # offload the rest of activations to CPU model = offload_wrapper(model) ``` Will polish / add tests if this proposal sounds good. Differential Revision: [D39719854](https://our.internmc.facebook.com/intern/diff/D39719854/) [ghstack-poisoned]

Pull Request resolved: #85459 Passing in `offload_to_cpu=True` to checkpoint_wrapper is a bit confusing, because this causes the activation checkpoint args to be ignored and we do CPU offloading. This isn't ideal from API design perspective, so proposing to make `offload_wrapper` its own concept. Now, offload to CPU + checkpoint can be composed together, such as ``` # apply AC to transformer layers apply_ac_wrapper(model, checkpoint_wrapper, check_fn=lambda mod: isinstance(mod, TransformerLayer)) # offload the rest of activations to CPU model = offload_wrapper(model) ``` Will polish / add tests if this proposal sounds good. ghstack-source-id: 667e11e Differential Revision: [D39719854](https://our.internmc.facebook.com/intern/diff/D39719854/)

rohan-varma · 2022-10-10T19:28:40Z

discussed with @awgu to punt the renaming to the future when API is more finalized and functional approach is being taken.

Passing in `offload_to_cpu=True` to checkpoint_wrapper is a bit confusing, because this causes the activation checkpoint args to be ignored and we do CPU offloading. This isn't ideal from API design perspective, so proposing to make `offload_wrapper` its own concept. Now, offload to CPU + checkpoint can be composed together, such as ``` # apply AC to transformer layers apply_ac_wrapper(model, checkpoint_wrapper, check_fn=lambda mod: isinstance(mod, TransformerLayer)) # offload the rest of activations to CPU model = offload_wrapper(model) ``` Will polish / add tests if this proposal sounds good. Differential Revision: [D39719854](https://our.internmc.facebook.com/intern/diff/D39719854/) [ghstack-poisoned]

Pull Request resolved: #85459 Passing in `offload_to_cpu=True` to checkpoint_wrapper is a bit confusing, because this causes the activation checkpoint args to be ignored and we do CPU offloading. This isn't ideal from API design perspective, so proposing to make `offload_wrapper` its own concept. Now, offload to CPU + checkpoint can be composed together, such as ``` apply_ac_wrapper(model, checkpoint_wrapper, check_fn=lambda mod: isinstance(mod, TransformerLayer)) model = offload_wrapper(model) ``` Will polish / add tests if this proposal sounds good. ghstack-source-id: f5c0100 Differential Revision: [D39719854](https://our.internmc.facebook.com/intern/diff/D39719854/)

awgu

LGTM!

torch/distributed/algorithms/_checkpoint/checkpoint_wrapper.py

awgu · 2022-10-11T13:28:26Z

test/distributed/fsdp/test_fsdp_checkpoint.py

+                    ckpt_wrapper = offload_wrapper
+                else:
+                    ckpt_wrapper = checkpoint_wrapper
+                    pass


nit: remove the pass?

awgu · 2022-10-11T13:30:20Z

test/distributed/fsdp/test_fsdp_checkpoint.py

            fsdp_only_seq = FSDP(deepcopy(seq), cpu_offload=cpu_offload)
            # Runs checkpoint-wrapped FSDP
+            if offload_activations:
+                checkpoint_wrapper = offload_wrapper


nit: maybe this should just be called activation_wrapper or wrapper (and change the instances below) or else you are shadowing the actual checkpoint_wrapper method, which might be confusing since it may be both checkpoint_wrapper or offload_wrapper

awgu · 2022-10-11T13:31:23Z

test/distributed/fsdp/test_fsdp_checkpoint.py

-                    checkpoint_wrapper, offload_to_cpu=offload_activations
-                )
+                if offload_activations:
+                    ckpt_wrapper = offload_wrapper


Related to the comment below: If you decide on a common name that encapsulates both checkpoint_wrapper and offload_wrapper, then you can change the ckpt_wrapper here too.

Passing in `offload_to_cpu=True` to checkpoint_wrapper is a bit confusing, because this causes the activation checkpoint args to be ignored and we do CPU offloading. This isn't ideal from API design perspective, so proposing to make `offload_wrapper` its own concept. Now, offload to CPU + checkpoint can be composed together, such as ``` # apply AC to transformer layers apply_ac_wrapper(model, checkpoint_wrapper, check_fn=lambda mod: isinstance(mod, TransformerLayer)) # offload the rest of activations to CPU model = offload_wrapper(model) ``` Will polish / add tests if this proposal sounds good. Differential Revision: [D39719854](https://our.internmc.facebook.com/intern/diff/D39719854/) [ghstack-poisoned]

Pull Request resolved: #85459 Passing in `offload_to_cpu=True` to checkpoint_wrapper is a bit confusing, because this causes the activation checkpoint args to be ignored and we do CPU offloading. This isn't ideal from API design perspective, so proposing to make `offload_wrapper` its own concept. Now, offload to CPU + checkpoint can be composed together, such as ``` apply_ac_wrapper(model, checkpoint_wrapper, check_fn=lambda mod: isinstance(mod, TransformerLayer)) model = offload_wrapper(model) ``` Will polish / add tests if this proposal sounds good. ghstack-source-id: cc04a99 Differential Revision: [D39719854](https://our.internmc.facebook.com/intern/diff/D39719854/)

rohan-varma · 2022-10-14T20:59:54Z

@rohan-varma has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

rohan-varma · 2022-10-15T05:17:39Z

upload test status - Blocked is the failure, unrelated. Landing

rohan-varma · 2022-10-15T05:17:50Z

@pytorchbot merge -f "CI failure unrelated"

pytorchmergebot · 2022-10-15T05:19:24Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

github-actions · 2022-10-15T05:20:08Z

Hey @rohan-varma.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

rohan-varma requested review from H-Huang, awgu, mingzhe09088, mrshenli, pritamdamania87 and zhaojuanmao as code owners September 22, 2022 06:12

facebook-github-bot added the cla signed label Sep 22, 2022

This was referenced Sep 22, 2022

[Docs] Minor fix to apply_ac doc #85448

Closed

[Checkpoint] rename apply_ac_wrapper #85449

Closed

facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Sep 22, 2022

awgu reviewed Sep 24, 2022

View reviewed changes

rohan-varma mentioned this pull request Oct 3, 2022

[Activation Checkpointing] Investigate pin_memory for CPU offload #86097

Open

awgu reviewed Oct 3, 2022

View reviewed changes

rohan-varma requested a review from kwen2501 as a code owner October 6, 2022 23:58

rohan-varma requested a review from awgu October 11, 2022 06:04

awgu approved these changes Oct 11, 2022

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 11, 2022

rohan-varma mentioned this pull request Oct 14, 2022

[FSDP][BE][Docs] Expose helper classes #85495

Closed

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Oct 14, 2022

pytorchmergebot added the Merged label Oct 15, 2022

pytorchmergebot closed this in bdefa26 Oct 15, 2022

facebook-github-bot deleted the gh/rohan-varma/596/head branch June 8, 2023 18:35

cli99 mentioned this pull request Nov 16, 2023

Fix activation cpu offloading mosaicml/composer#2724

Merged

	raise ValueError("Implement me")
	raise NotImplementedError("Subclasses should implement this")

	An nn.Module that wraps another nn.Module with checkpointing. Note that this
	An ``nn.Module`` that wraps another ``nn.Module`` with checkpointing. Note that this

[RFC] Separate CPU offload activation to its own wrapper #85459

[RFC] Separate CPU offload activation to its own wrapper #85459

Uh oh!

Conversation

rohan-varma commented Sep 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/85459

✅ No Failures

Uh oh!

vadimkantorov commented Sep 22, 2022

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

awgu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

facebook-github-bot commented Oct 4, 2022

Uh oh!

rohan-varma commented Oct 10, 2022

Uh oh!

awgu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rohan-varma commented Oct 14, 2022

Uh oh!

rohan-varma commented Oct 15, 2022

Uh oh!

rohan-varma commented Oct 15, 2022

Uh oh!

pytorchmergebot commented Oct 15, 2022

Merge started

Uh oh!

github-actions bot commented Oct 15, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

rohan-varma commented Sep 22, 2022 •

edited

Loading

pytorch-bot bot commented Sep 22, 2022 •

edited

Loading