-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[FSDP] summon offload to CPU #73904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FSDP] summon offload to CPU #73904
Conversation
Implement ability to offload full params to CPU in summon_full_params. Differential Revision: [D34707801](https://our.internmc.facebook.com/intern/diff/D34707801/) [ghstack-poisoned]
CI Flow Status⚛️ CI FlowRuleset - Version:
|
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit b21c4b2 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
Implement ability to offload full params to CPU in summon_full_params. Differential Revision: [D34707801](https://our.internmc.facebook.com/intern/diff/D34707801/) ghstack-source-id: 150769189 Pull Request resolved: #73904
Implement ability to offload full params to CPU in summon_full_params. Differential Revision: [D34707801](https://our.internmc.facebook.com/intern/diff/D34707801/) [ghstack-poisoned]
Pull Request resolved: #73904 Implement ability to offload full params to CPU in summon_full_params. ghstack-source-id: 150794027 Differential Revision: [D34707801](https://our.internmc.facebook.com/intern/diff/D34707801/)
fegin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, one comment regarding the warning of the combination of rank_0 + cpu_offload
| offload_to_cpu (bool, optional): If ``True``, full parameters are | ||
| offloaded to CPU. Note that this offloading currently only | ||
| occurs if the parameter is sharded (which is only not the case | ||
| for world_size = 1). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some rare cases, users may want to actually have multiple CPU copies, but we should add a note, warning or check to warn users if offload_to_cpu is True but rank0_only is False. At least users should be fully aware the potential CPU OOM issue that can happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good, added a warning
Implement ability to offload full params to CPU in summon_full_params. Differential Revision: [D34707801](https://our.internmc.facebook.com/intern/diff/D34707801/) [ghstack-poisoned]
Pull Request resolved: #73904 Implement ability to offload full params to CPU in summon_full_params. ghstack-source-id: 150949235 Differential Revision: [D34707801](https://our.internmc.facebook.com/intern/diff/D34707801/)
| ) | ||
| self._update_p_data( | ||
| p, output_tensor=p._full_param_padded, # type: ignore[attr-defined] | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhaojuanmao just want to double check this offloading approach since it's a bit different than regular parameter offload.
Here we have full_param_padded materialized, and p.data points to it. So to offload, we:
- offload full param padded
- Update p.data to point to the offloaded full param padded.
And we restore appropriately when exiting the context.
If we directly offloaded p.data, it would still hold onto GPU memory because of full param padded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, this looks good to me!
zhaojuanmao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left some minor comments, looks great!
in terms of enabling these configs in state_dict() APIs, could we make rank0_only as default setting and cpu_offloading as optional?
| ) | ||
|
|
||
| if writeback: | ||
| self._write_back_current_shard() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just realized when fsdp main API cpu offloading is enabled, if writeback=True, it will make parameters point to local shard in GPUs, not local shards in CPUs. Which looks like a bug? maybe we can restrict writeback for non cpu offloading only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look into this and file an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implement ability to offload full params to CPU in summon_full_params. Differential Revision: [D34707801](https://our.internmc.facebook.com/intern/diff/D34707801/) [ghstack-poisoned]
Pull Request resolved: #73904 Implement ability to offload full params to CPU in summon_full_params. ghstack-source-id: 151311236 Differential Revision: [D34707801](https://our.internmc.facebook.com/intern/diff/D34707801/)
Summary: Pull Request resolved: #73904 Implement ability to offload full params to CPU in summon_full_params. ghstack-source-id: 151311236 Test Plan: CI Reviewed By: fegin Differential Revision: D34707801 fbshipit-source-id: efc9c568037ddfeb9ddaff45a1a0388c6bc85825
|
Hey @rohan-varma. |
Stack from ghstack (oldest at bottom):
Implement ability to offload full params to CPU in summon_full_params.
Differential Revision: D34707801