-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[FSDP] Fix summon_full_params test #74456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This test did not actually do any CPU offloading. Adding it revealed an issue in no_shard (currently only when world_size == 1) case which is tracked for a fix in #74166 Differential Revision: [D35003793](https://our.internmc.facebook.com/intern/diff/D35003793/) [ghstack-poisoned]
CI Flow Status⚛️ CI FlowRuleset - Version:
|
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit e86140c (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
This test did not actually do any CPU offloading. Adding it revealed an issue in no_shard (currently only when world_size == 1) case which is tracked for a fix in #74166 Differential Revision: [D35003793](https://our.internmc.facebook.com/intern/diff/D35003793/) [ghstack-poisoned]
This test did not actually do any CPU offloading. Adding it revealed an issue in no_shard (currently only when world_size == 1) case which is tracked for a fix in #74166 Differential Revision: [D35003793](https://our.internmc.facebook.com/intern/diff/D35003793/) [ghstack-poisoned]
This test did not actually do any CPU offloading. Adding it revealed an issue in no_shard (currently only when world_size == 1) case which is tracked for a fix in #74166 Differential Revision: [D35003793](https://our.internmc.facebook.com/intern/diff/D35003793/) [ghstack-poisoned]
This test did not actually do any CPU offloading. Adding it revealed an issue in no_shard (currently only when world_size == 1) case which is tracked for a fix in #74166 Differential Revision: [D35003793](https://our.internmc.facebook.com/intern/diff/D35003793/) [ghstack-poisoned]
| lin1 = FSDP(nn.Linear(5, 5, bias=False).cuda(cls.rank), cpu_offload=cpu_offload) | ||
| lin2 = nn.Linear(5, 3, bias=False).cuda(cls.rank) | ||
| model = FSDP(nn.Sequential(lin1, lin2), cpu_offload=cpu_offload) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious, what's the expectation on the input model when cpu_offload==True? Does it have to reside on GPU? Or is it OK to pass in a CPU model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing CPU offload with CPU model should work, as we don't assume any device in the constructor and would restore to GPU in forward pass before all_gather. Although, it does not seem that we have a test for this, we should add one.
| lin2 = nn.Linear(5, 3, bias=False).cuda(cls.rank) | ||
| model = FSDP(nn.Sequential(lin1, lin2), cpu_offload=cpu_offload) | ||
| if not cpu_offload: | ||
| model = model.cuda(cls.rank) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this? I assume when cpu_offload==False, the model is already on cuda, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, we can remove this because lin1, lin2 are already on CUDA device.
This test did not actually do any CPU offloading. Adding it revealed an issue in no_shard (currently only when world_size == 1) case which is tracked for a fix in #74166 Differential Revision: [D35003793](https://our.internmc.facebook.com/intern/diff/D35003793/) [ghstack-poisoned]
Summary: Pull Request resolved: #74456 This test did not actually do any CPU offloading. Adding it revealed an issue in no_shard (currently only when world_size == 1) case which is tracked for a fix in #74166 ghstack-source-id: 152057449 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D35003793 fbshipit-source-id: ffb5f8ea897cda84584ae83d362bea3c0c407c3b
|
Hey @rohan-varma. |
Summary: Pull Request resolved: #74456 This test did not actually do any CPU offloading. Adding it revealed an issue in no_shard (currently only when world_size == 1) case which is tracked for a fix in #74166 ghstack-source-id: 152057449 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D35003793 fbshipit-source-id: ffb5f8ea897cda84584ae83d362bea3c0c407c3b (cherry picked from commit 06afeba)
Stack from ghstack (oldest at bottom):
This test did not actually do any CPU offloading. Adding it revealed
an issue in no_shard (currently only when world_size == 1) case which is
tracked for a fix in #74166
Differential Revision: D35003793