Skip to content

Conversation

@rohan-varma
Copy link
Contributor

@rohan-varma rohan-varma commented Sep 22, 2022

Stack from ghstack (oldest at bottom):

Passing in offload_to_cpu=True to checkpoint_wrapper is a bit confusing, because this causes the activation checkpoint args to be ignored and we do CPU offloading. This isn't ideal from API design perspective, so proposing to make offload_wrapper its own concept.

Now, offload to CPU + checkpoint can be composed together, such as

# apply AC to transformer layers
apply_ac_wrapper(model, checkpoint_wrapper, check_fn=lambda mod: isinstance(mod, TransformerLayer))
# offload the rest of activations to CPU
model = offload_wrapper(model)

Will polish / add tests if this proposal sounds good.

Differential Revision: D39719854

Passing in `offload_to_cpu=True` to checkpoint_wrapper is a bit confusing, because this causes the activation checkpoint args to be ignored and we do CPU offloading. This isn't ideal from API design perspective, so proposing to make `offload_wrapper` its own concept.

Now, offload to CPU + checkpoint can be composed together, such as

```
# apply AC to transformer layers
apply_ac_wrapper(model, checkpoint_wrapper, check_fn=lambda mod: isinstance(mod, TransformerLayer))
# offload the rest of activations to CPU
model = offload_wrapper(model)
```

Will polish / add tests if this proposal sounds good.

Differential Revision: [D39719854](https://our.internmc.facebook.com/intern/diff/D39719854/)

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 22, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/85459

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 59d1dc3:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Sep 22, 2022
rohan-varma added a commit that referenced this pull request Sep 22, 2022
Passing in `offload_to_cpu=True` to checkpoint_wrapper is a bit confusing, because this causes the activation checkpoint args to be ignored and we do CPU offloading. This isn't ideal from API design perspective, so proposing to make `offload_wrapper` its own concept.

Now, offload to CPU + checkpoint can be composed together, such as

```
# apply AC to transformer layers
apply_ac_wrapper(model, checkpoint_wrapper, check_fn=lambda mod: isinstance(mod, TransformerLayer))
# offload the rest of activations to CPU
model = offload_wrapper(model)
```

Will polish / add tests if this proposal sounds good.

Differential Revision: [D39719854](https://our.internmc.facebook.com/intern/diff/D39719854/)

ghstack-source-id: 168124511
Pull Request resolved: #85459
@vadimkantorov
Copy link
Contributor

Related: #70166

Also, if autocast also existed as explicit wrapper, these problems would be easier to understand: #85340 #45910

Passing in `offload_to_cpu=True` to checkpoint_wrapper is a bit confusing, because this causes the activation checkpoint args to be ignored and we do CPU offloading. This isn't ideal from API design perspective, so proposing to make `offload_wrapper` its own concept.

Now, offload to CPU + checkpoint can be composed together, such as

```
# apply AC to transformer layers
apply_ac_wrapper(model, checkpoint_wrapper, check_fn=lambda mod: isinstance(mod, TransformerLayer))
# offload the rest of activations to CPU
model = offload_wrapper(model)
```

Will polish / add tests if this proposal sounds good.

Differential Revision: [D39719854](https://our.internmc.facebook.com/intern/diff/D39719854/)

[ghstack-poisoned]
Passing in `offload_to_cpu=True` to checkpoint_wrapper is a bit confusing, because this causes the activation checkpoint args to be ignored and we do CPU offloading. This isn't ideal from API design perspective, so proposing to make `offload_wrapper` its own concept.

Now, offload to CPU + checkpoint can be composed together, such as

```
# apply AC to transformer layers
apply_ac_wrapper(model, checkpoint_wrapper, check_fn=lambda mod: isinstance(mod, TransformerLayer))
# offload the rest of activations to CPU
model = offload_wrapper(model)
```

Will polish / add tests if this proposal sounds good.

Differential Revision: [D39719854](https://our.internmc.facebook.com/intern/diff/D39719854/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 22, 2022
Pull Request resolved: #85459

Passing in `offload_to_cpu=True` to checkpoint_wrapper is a bit confusing, because this causes the activation checkpoint args to be ignored and we do CPU offloading. This isn't ideal from API design perspective, so proposing to make `offload_wrapper` its own concept.

Now, offload to CPU + checkpoint can be composed together, such as

```
# apply AC to transformer layers
apply_ac_wrapper(model, checkpoint_wrapper, check_fn=lambda mod: isinstance(mod, TransformerLayer))
# offload the rest of activations to CPU
model = offload_wrapper(model)
```

Will polish / add tests if this proposal sounds good.
ghstack-source-id: 168196009

Differential Revision: [D39719854](https://our.internmc.facebook.com/intern/diff/D39719854/)
super().__init__(mod)

def forward(self, *args, **kwargs):
with save_on_cpu(pin_memory=True):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tangentially related:

You should not over-allocate pinned memory. Doing so can reduce overall system performance because it reduces the amount of physical memory available to the operating system and other programs. How much is too much is difficult to tell in advance, so as with all optimizations, test your applications and the systems they run on for optimal performance parameters.

https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/

For activation offload, we operate under the assumption that we have unlimited memory to pin. I think it is worthwhile to investigate the robustness of this assumption.

Without any offloading, the total activation size is upper bounded by the GPU memory size, but with offloading, that can be increased significantly (e.g. <= the number of activations times the GPU memory size). For some workloads that activation offloading targets (e.g. large batch size and activations; small model), I imagine that the total activation size can be larger than the GPU memory size. This is exacerbated if CPU offloading is simultaneously enabled.

That being said, if the CPU memory size is sufficiently large and not used for much else during program execution, then pinning significant amounts of CPU memory may not be a problem. I think we should simply try to get some better intuition here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. I filed an issue: #86097. In the medium term if we do determine we shouldn't always pin memory, we might make this flag configurable or in the long-term possibly something that can be autotuned.

Copy link
Collaborator

@awgu awgu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some nits.

I think renaming to apply_activation_wrapper() in the short term is still reasonable. That being said, I am curious:

  1. What is the cost to renaming it now? The previous PR renamed it, so it seems to be okay to rename.
  2. What are the relevant future plans/discussions that may affect the naming?

)

def forward(self, *args, **kwargs):
raise ValueError("Implement me")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
raise ValueError("Implement me")
raise NotImplementedError("Subclasses should implement this")


class CheckpointWrapper(ActivationWrapper):
"""
An nn.Module that wraps another nn.Module with checkpointing. Note that this
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Do we want these in code type?

Suggested change
An nn.Module that wraps another nn.Module with checkpointing. Note that this
An ``nn.Module`` that wraps another ``nn.Module`` with checkpointing. Note that this

*checkpoint_fn_args,
**checkpoint_fn_kwargs,
):
super().__init__()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lint error seems real?

Suggested change
super().__init__()
super().__init__(mod)

@facebook-github-bot
Copy link
Contributor

/easycla

As part of the transition to the PyTorch Foundation, this project now requires contributions be covered under the new CLA. See #85559 for additional details.

This comment will trigger a new check of this PR. If you are already covered, you will simply see a new "EasyCLA" check that passes. If you are not covered, a bot will leave a new comment with a link to sign.

Passing in `offload_to_cpu=True` to checkpoint_wrapper is a bit confusing, because this causes the activation checkpoint args to be ignored and we do CPU offloading. This isn't ideal from API design perspective, so proposing to make `offload_wrapper` its own concept.

Now, offload to CPU + checkpoint can be composed together, such as

```
# apply AC to transformer layers
apply_ac_wrapper(model, checkpoint_wrapper, check_fn=lambda mod: isinstance(mod, TransformerLayer))
# offload the rest of activations to CPU
model = offload_wrapper(model)
```

Will polish / add tests if this proposal sounds good.

Differential Revision: [D39719854](https://our.internmc.facebook.com/intern/diff/D39719854/)

[ghstack-poisoned]
@rohan-varma rohan-varma requested a review from kwen2501 as a code owner October 6, 2022 23:58
rohan-varma added a commit that referenced this pull request Oct 6, 2022
Pull Request resolved: #85459

Passing in `offload_to_cpu=True` to checkpoint_wrapper is a bit confusing, because this causes the activation checkpoint args to be ignored and we do CPU offloading. This isn't ideal from API design perspective, so proposing to make `offload_wrapper` its own concept.

Now, offload to CPU + checkpoint can be composed together, such as

```
# apply AC to transformer layers
apply_ac_wrapper(model, checkpoint_wrapper, check_fn=lambda mod: isinstance(mod, TransformerLayer))
# offload the rest of activations to CPU
model = offload_wrapper(model)
```

Will polish / add tests if this proposal sounds good.
ghstack-source-id: 667e11e

Differential Revision: [D39719854](https://our.internmc.facebook.com/intern/diff/D39719854/)
@rohan-varma
Copy link
Contributor Author

discussed with @awgu to punt the renaming to the future when API is more finalized and functional approach is being taken.

Passing in `offload_to_cpu=True` to checkpoint_wrapper is a bit confusing, because this causes the activation checkpoint args to be ignored and we do CPU offloading. This isn't ideal from API design perspective, so proposing to make `offload_wrapper` its own concept.

Now, offload to CPU + checkpoint can be composed together, such as

```
# apply AC to transformer layers
apply_ac_wrapper(model, checkpoint_wrapper, check_fn=lambda mod: isinstance(mod, TransformerLayer))
# offload the rest of activations to CPU
model = offload_wrapper(model)
```

Will polish / add tests if this proposal sounds good.

Differential Revision: [D39719854](https://our.internmc.facebook.com/intern/diff/D39719854/)

[ghstack-poisoned]
Passing in `offload_to_cpu=True` to checkpoint_wrapper is a bit confusing, because this causes the activation checkpoint args to be ignored and we do CPU offloading. This isn't ideal from API design perspective, so proposing to make `offload_wrapper` its own concept.

Now, offload to CPU + checkpoint can be composed together, such as

```
# apply AC to transformer layers
apply_ac_wrapper(model, checkpoint_wrapper, check_fn=lambda mod: isinstance(mod, TransformerLayer))
# offload the rest of activations to CPU
model = offload_wrapper(model)
```

Will polish / add tests if this proposal sounds good.

Differential Revision: [D39719854](https://our.internmc.facebook.com/intern/diff/D39719854/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Oct 11, 2022
Pull Request resolved: #85459

Passing in `offload_to_cpu=True` to checkpoint_wrapper is a bit confusing, because this causes the activation checkpoint args to be ignored and we do CPU offloading. This isn't ideal from API design perspective, so proposing to make `offload_wrapper` its own concept.

Now, offload to CPU + checkpoint can be composed together, such as

```
apply_ac_wrapper(model, checkpoint_wrapper, check_fn=lambda mod: isinstance(mod, TransformerLayer))
model = offload_wrapper(model)
```

Will polish / add tests if this proposal sounds good.
ghstack-source-id: f5c0100

Differential Revision: [D39719854](https://our.internmc.facebook.com/intern/diff/D39719854/)
@rohan-varma rohan-varma requested a review from awgu October 11, 2022 06:04
Copy link
Collaborator

@awgu awgu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

ckpt_wrapper = offload_wrapper
else:
ckpt_wrapper = checkpoint_wrapper
pass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove the pass?

fsdp_only_seq = FSDP(deepcopy(seq), cpu_offload=cpu_offload)
# Runs checkpoint-wrapped FSDP
if offload_activations:
checkpoint_wrapper = offload_wrapper
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe this should just be called activation_wrapper or wrapper (and change the instances below) or else you are shadowing the actual checkpoint_wrapper method, which might be confusing since it may be both checkpoint_wrapper or offload_wrapper

checkpoint_wrapper, offload_to_cpu=offload_activations
)
if offload_activations:
ckpt_wrapper = offload_wrapper
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to the comment below: If you decide on a common name that encapsulates both checkpoint_wrapper and offload_wrapper, then you can change the ckpt_wrapper here too.

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 11, 2022
Passing in `offload_to_cpu=True` to checkpoint_wrapper is a bit confusing, because this causes the activation checkpoint args to be ignored and we do CPU offloading. This isn't ideal from API design perspective, so proposing to make `offload_wrapper` its own concept.

Now, offload to CPU + checkpoint can be composed together, such as

```
# apply AC to transformer layers
apply_ac_wrapper(model, checkpoint_wrapper, check_fn=lambda mod: isinstance(mod, TransformerLayer))
# offload the rest of activations to CPU
model = offload_wrapper(model)
```

Will polish / add tests if this proposal sounds good.

Differential Revision: [D39719854](https://our.internmc.facebook.com/intern/diff/D39719854/)

[ghstack-poisoned]
Passing in `offload_to_cpu=True` to checkpoint_wrapper is a bit confusing, because this causes the activation checkpoint args to be ignored and we do CPU offloading. This isn't ideal from API design perspective, so proposing to make `offload_wrapper` its own concept.

Now, offload to CPU + checkpoint can be composed together, such as

```
# apply AC to transformer layers
apply_ac_wrapper(model, checkpoint_wrapper, check_fn=lambda mod: isinstance(mod, TransformerLayer))
# offload the rest of activations to CPU
model = offload_wrapper(model)
```

Will polish / add tests if this proposal sounds good.

Differential Revision: [D39719854](https://our.internmc.facebook.com/intern/diff/D39719854/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Oct 14, 2022
Pull Request resolved: #85459

Passing in `offload_to_cpu=True` to checkpoint_wrapper is a bit confusing, because this causes the activation checkpoint args to be ignored and we do CPU offloading. This isn't ideal from API design perspective, so proposing to make `offload_wrapper` its own concept.

Now, offload to CPU + checkpoint can be composed together, such as

```
apply_ac_wrapper(model, checkpoint_wrapper, check_fn=lambda mod: isinstance(mod, TransformerLayer))
model = offload_wrapper(model)
```

Will polish / add tests if this proposal sounds good.
ghstack-source-id: cc04a99

Differential Revision: [D39719854](https://our.internmc.facebook.com/intern/diff/D39719854/)
@rohan-varma
Copy link
Contributor Author

@rohan-varma has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@rohan-varma
Copy link
Contributor Author

upload test status - Blocked is the failure, unrelated. Landing

@rohan-varma
Copy link
Contributor Author

@pytorchbot merge -f "CI failure unrelated"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@github-actions
Copy link
Contributor

Hey @rohan-varma.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

@facebook-github-bot facebook-github-bot deleted the gh/rohan-varma/596/head branch June 8, 2023 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants