-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[RFC] Separate CPU offload activation to its own wrapper #85459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Passing in `offload_to_cpu=True` to checkpoint_wrapper is a bit confusing, because this causes the activation checkpoint args to be ignored and we do CPU offloading. This isn't ideal from API design perspective, so proposing to make `offload_wrapper` its own concept. Now, offload to CPU + checkpoint can be composed together, such as ``` # apply AC to transformer layers apply_ac_wrapper(model, checkpoint_wrapper, check_fn=lambda mod: isinstance(mod, TransformerLayer)) # offload the rest of activations to CPU model = offload_wrapper(model) ``` Will polish / add tests if this proposal sounds good. Differential Revision: [D39719854](https://our.internmc.facebook.com/intern/diff/D39719854/) [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/85459
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 59d1dc3: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Passing in `offload_to_cpu=True` to checkpoint_wrapper is a bit confusing, because this causes the activation checkpoint args to be ignored and we do CPU offloading. This isn't ideal from API design perspective, so proposing to make `offload_wrapper` its own concept. Now, offload to CPU + checkpoint can be composed together, such as ``` # apply AC to transformer layers apply_ac_wrapper(model, checkpoint_wrapper, check_fn=lambda mod: isinstance(mod, TransformerLayer)) # offload the rest of activations to CPU model = offload_wrapper(model) ``` Will polish / add tests if this proposal sounds good. Differential Revision: [D39719854](https://our.internmc.facebook.com/intern/diff/D39719854/) ghstack-source-id: 168124511 Pull Request resolved: #85459
Passing in `offload_to_cpu=True` to checkpoint_wrapper is a bit confusing, because this causes the activation checkpoint args to be ignored and we do CPU offloading. This isn't ideal from API design perspective, so proposing to make `offload_wrapper` its own concept. Now, offload to CPU + checkpoint can be composed together, such as ``` # apply AC to transformer layers apply_ac_wrapper(model, checkpoint_wrapper, check_fn=lambda mod: isinstance(mod, TransformerLayer)) # offload the rest of activations to CPU model = offload_wrapper(model) ``` Will polish / add tests if this proposal sounds good. Differential Revision: [D39719854](https://our.internmc.facebook.com/intern/diff/D39719854/) [ghstack-poisoned]
Passing in `offload_to_cpu=True` to checkpoint_wrapper is a bit confusing, because this causes the activation checkpoint args to be ignored and we do CPU offloading. This isn't ideal from API design perspective, so proposing to make `offload_wrapper` its own concept. Now, offload to CPU + checkpoint can be composed together, such as ``` # apply AC to transformer layers apply_ac_wrapper(model, checkpoint_wrapper, check_fn=lambda mod: isinstance(mod, TransformerLayer)) # offload the rest of activations to CPU model = offload_wrapper(model) ``` Will polish / add tests if this proposal sounds good. Differential Revision: [D39719854](https://our.internmc.facebook.com/intern/diff/D39719854/) [ghstack-poisoned]
Pull Request resolved: #85459 Passing in `offload_to_cpu=True` to checkpoint_wrapper is a bit confusing, because this causes the activation checkpoint args to be ignored and we do CPU offloading. This isn't ideal from API design perspective, so proposing to make `offload_wrapper` its own concept. Now, offload to CPU + checkpoint can be composed together, such as ``` # apply AC to transformer layers apply_ac_wrapper(model, checkpoint_wrapper, check_fn=lambda mod: isinstance(mod, TransformerLayer)) # offload the rest of activations to CPU model = offload_wrapper(model) ``` Will polish / add tests if this proposal sounds good. ghstack-source-id: 168196009 Differential Revision: [D39719854](https://our.internmc.facebook.com/intern/diff/D39719854/)
| super().__init__(mod) | ||
|
|
||
| def forward(self, *args, **kwargs): | ||
| with save_on_cpu(pin_memory=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tangentially related:
You should not over-allocate pinned memory. Doing so can reduce overall system performance because it reduces the amount of physical memory available to the operating system and other programs. How much is too much is difficult to tell in advance, so as with all optimizations, test your applications and the systems they run on for optimal performance parameters.
https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/
For activation offload, we operate under the assumption that we have unlimited memory to pin. I think it is worthwhile to investigate the robustness of this assumption.
Without any offloading, the total activation size is upper bounded by the GPU memory size, but with offloading, that can be increased significantly (e.g. <= the number of activations times the GPU memory size). For some workloads that activation offloading targets (e.g. large batch size and activations; small model), I imagine that the total activation size can be larger than the GPU memory size. This is exacerbated if CPU offloading is simultaneously enabled.
That being said, if the CPU memory size is sufficiently large and not used for much else during program execution, then pinning significant amounts of CPU memory may not be a problem. I think we should simply try to get some better intuition here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense. I filed an issue: #86097. In the medium term if we do determine we shouldn't always pin memory, we might make this flag configurable or in the long-term possibly something that can be autotuned.
awgu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some nits.
I think renaming to apply_activation_wrapper() in the short term is still reasonable. That being said, I am curious:
- What is the cost to renaming it now? The previous PR renamed it, so it seems to be okay to rename.
- What are the relevant future plans/discussions that may affect the naming?
| ) | ||
|
|
||
| def forward(self, *args, **kwargs): | ||
| raise ValueError("Implement me") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
| raise ValueError("Implement me") | |
| raise NotImplementedError("Subclasses should implement this") |
|
|
||
| class CheckpointWrapper(ActivationWrapper): | ||
| """ | ||
| An nn.Module that wraps another nn.Module with checkpointing. Note that this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Do we want these in code type?
| An nn.Module that wraps another nn.Module with checkpointing. Note that this | |
| An ``nn.Module`` that wraps another ``nn.Module`` with checkpointing. Note that this |
| *checkpoint_fn_args, | ||
| **checkpoint_fn_kwargs, | ||
| ): | ||
| super().__init__() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lint error seems real?
| super().__init__() | |
| super().__init__(mod) |
|
/easycla As part of the transition to the PyTorch Foundation, this project now requires contributions be covered under the new CLA. See #85559 for additional details. This comment will trigger a new check of this PR. If you are already covered, you will simply see a new "EasyCLA" check that passes. If you are not covered, a bot will leave a new comment with a link to sign. |
Passing in `offload_to_cpu=True` to checkpoint_wrapper is a bit confusing, because this causes the activation checkpoint args to be ignored and we do CPU offloading. This isn't ideal from API design perspective, so proposing to make `offload_wrapper` its own concept. Now, offload to CPU + checkpoint can be composed together, such as ``` # apply AC to transformer layers apply_ac_wrapper(model, checkpoint_wrapper, check_fn=lambda mod: isinstance(mod, TransformerLayer)) # offload the rest of activations to CPU model = offload_wrapper(model) ``` Will polish / add tests if this proposal sounds good. Differential Revision: [D39719854](https://our.internmc.facebook.com/intern/diff/D39719854/) [ghstack-poisoned]
Pull Request resolved: #85459 Passing in `offload_to_cpu=True` to checkpoint_wrapper is a bit confusing, because this causes the activation checkpoint args to be ignored and we do CPU offloading. This isn't ideal from API design perspective, so proposing to make `offload_wrapper` its own concept. Now, offload to CPU + checkpoint can be composed together, such as ``` # apply AC to transformer layers apply_ac_wrapper(model, checkpoint_wrapper, check_fn=lambda mod: isinstance(mod, TransformerLayer)) # offload the rest of activations to CPU model = offload_wrapper(model) ``` Will polish / add tests if this proposal sounds good. ghstack-source-id: 667e11e Differential Revision: [D39719854](https://our.internmc.facebook.com/intern/diff/D39719854/)
|
discussed with @awgu to punt the renaming to the future when API is more finalized and functional approach is being taken. |
Passing in `offload_to_cpu=True` to checkpoint_wrapper is a bit confusing, because this causes the activation checkpoint args to be ignored and we do CPU offloading. This isn't ideal from API design perspective, so proposing to make `offload_wrapper` its own concept. Now, offload to CPU + checkpoint can be composed together, such as ``` # apply AC to transformer layers apply_ac_wrapper(model, checkpoint_wrapper, check_fn=lambda mod: isinstance(mod, TransformerLayer)) # offload the rest of activations to CPU model = offload_wrapper(model) ``` Will polish / add tests if this proposal sounds good. Differential Revision: [D39719854](https://our.internmc.facebook.com/intern/diff/D39719854/) [ghstack-poisoned]
Passing in `offload_to_cpu=True` to checkpoint_wrapper is a bit confusing, because this causes the activation checkpoint args to be ignored and we do CPU offloading. This isn't ideal from API design perspective, so proposing to make `offload_wrapper` its own concept. Now, offload to CPU + checkpoint can be composed together, such as ``` # apply AC to transformer layers apply_ac_wrapper(model, checkpoint_wrapper, check_fn=lambda mod: isinstance(mod, TransformerLayer)) # offload the rest of activations to CPU model = offload_wrapper(model) ``` Will polish / add tests if this proposal sounds good. Differential Revision: [D39719854](https://our.internmc.facebook.com/intern/diff/D39719854/) [ghstack-poisoned]
Pull Request resolved: #85459 Passing in `offload_to_cpu=True` to checkpoint_wrapper is a bit confusing, because this causes the activation checkpoint args to be ignored and we do CPU offloading. This isn't ideal from API design perspective, so proposing to make `offload_wrapper` its own concept. Now, offload to CPU + checkpoint can be composed together, such as ``` apply_ac_wrapper(model, checkpoint_wrapper, check_fn=lambda mod: isinstance(mod, TransformerLayer)) model = offload_wrapper(model) ``` Will polish / add tests if this proposal sounds good. ghstack-source-id: f5c0100 Differential Revision: [D39719854](https://our.internmc.facebook.com/intern/diff/D39719854/)
awgu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
| ckpt_wrapper = offload_wrapper | ||
| else: | ||
| ckpt_wrapper = checkpoint_wrapper | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove the pass?
| fsdp_only_seq = FSDP(deepcopy(seq), cpu_offload=cpu_offload) | ||
| # Runs checkpoint-wrapped FSDP | ||
| if offload_activations: | ||
| checkpoint_wrapper = offload_wrapper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe this should just be called activation_wrapper or wrapper (and change the instances below) or else you are shadowing the actual checkpoint_wrapper method, which might be confusing since it may be both checkpoint_wrapper or offload_wrapper
| checkpoint_wrapper, offload_to_cpu=offload_activations | ||
| ) | ||
| if offload_activations: | ||
| ckpt_wrapper = offload_wrapper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to the comment below: If you decide on a common name that encapsulates both checkpoint_wrapper and offload_wrapper, then you can change the ckpt_wrapper here too.
Passing in `offload_to_cpu=True` to checkpoint_wrapper is a bit confusing, because this causes the activation checkpoint args to be ignored and we do CPU offloading. This isn't ideal from API design perspective, so proposing to make `offload_wrapper` its own concept. Now, offload to CPU + checkpoint can be composed together, such as ``` # apply AC to transformer layers apply_ac_wrapper(model, checkpoint_wrapper, check_fn=lambda mod: isinstance(mod, TransformerLayer)) # offload the rest of activations to CPU model = offload_wrapper(model) ``` Will polish / add tests if this proposal sounds good. Differential Revision: [D39719854](https://our.internmc.facebook.com/intern/diff/D39719854/) [ghstack-poisoned]
Passing in `offload_to_cpu=True` to checkpoint_wrapper is a bit confusing, because this causes the activation checkpoint args to be ignored and we do CPU offloading. This isn't ideal from API design perspective, so proposing to make `offload_wrapper` its own concept. Now, offload to CPU + checkpoint can be composed together, such as ``` # apply AC to transformer layers apply_ac_wrapper(model, checkpoint_wrapper, check_fn=lambda mod: isinstance(mod, TransformerLayer)) # offload the rest of activations to CPU model = offload_wrapper(model) ``` Will polish / add tests if this proposal sounds good. Differential Revision: [D39719854](https://our.internmc.facebook.com/intern/diff/D39719854/) [ghstack-poisoned]
Pull Request resolved: #85459 Passing in `offload_to_cpu=True` to checkpoint_wrapper is a bit confusing, because this causes the activation checkpoint args to be ignored and we do CPU offloading. This isn't ideal from API design perspective, so proposing to make `offload_wrapper` its own concept. Now, offload to CPU + checkpoint can be composed together, such as ``` apply_ac_wrapper(model, checkpoint_wrapper, check_fn=lambda mod: isinstance(mod, TransformerLayer)) model = offload_wrapper(model) ``` Will polish / add tests if this proposal sounds good. ghstack-source-id: cc04a99 Differential Revision: [D39719854](https://our.internmc.facebook.com/intern/diff/D39719854/)
|
@rohan-varma has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
upload test status - Blocked is the failure, unrelated. Landing |
|
@pytorchbot merge -f "CI failure unrelated" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
Hey @rohan-varma. |
Stack from ghstack (oldest at bottom):
Passing in
offload_to_cpu=Trueto checkpoint_wrapper is a bit confusing, because this causes the activation checkpoint args to be ignored and we do CPU offloading. This isn't ideal from API design perspective, so proposing to makeoffload_wrapperits own concept.Now, offload to CPU + checkpoint can be composed together, such as
Will polish / add tests if this proposal sounds good.
Differential Revision: D39719854