Support OVERLAP_F_B in schedule#161072
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161072
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 11 Cancelled JobsAs of commit cad8b07 with merge base 6737e2c ( CANCELLED JOBS - The following jobs were cancelled. Please retry:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
| prev_actions[rank].add(send) | ||
| comm_actions[stage_to_rank(recv.stage_index)].append(recv) | ||
| prev_actions[stage_to_rank(recv.stage_index)].add(recv) | ||
| for a in all_actions: |
There was a problem hiding this comment.
Isn't it taking us further from where we want to end up, to split the sub actions out? I would think you'd just plumb overlap_f_b all the way through to the runtime asap. It's easy to decompose it in the runtime into f,b actions if there isn't an actual overlap runner available. Then you can keep the ir passes simpler. Wdyt?
(I was mostly referring to appending the sub-action to the schedule, not the comms)
There was a problem hiding this comment.
The overlap_f_b is handled in the runtime with the code added below under _step_microbatches. This section of the code was mostly some if-statements for overlap_f_b in the plan in order to pass the tests using _simulate_comms_compute
| ) | ||
| action = compute_actions[rank][0] | ||
| # handle new case where parent action (OVERLAP_F_B) can be comprised of subactions | ||
| if action is not None and action.sub_actions is not None: |
There was a problem hiding this comment.
Can we assert action == OVERLAP_F_B here?
There was a problem hiding this comment.
In theory a user custom action with sub_actions can also get handled, so this is more general than just OVERLAP_F_B. Let me update the comment to reflect this
| logger.error( | ||
| "_PipelineScheduleRuntime caught exception at step %s when running action %s. Full Schedule:", | ||
| time_step, | ||
| action, |
There was a problem hiding this comment.
Do we want to error out the "sub-action" or "action"? Another comment is that this try/except seems to be make code harder to read. Since you already make this a function, may be just do try/except inside the for loop below? This way you can also control which action you want to print out (sub-action or action).
There was a problem hiding this comment.
That's a good point! We want to error out in the action, I will change the try-except to do it in the loop below
wconstab
left a comment
There was a problem hiding this comment.
Overall Lgtm, you can land and address my comment later if you want
| # count either full_backward or backward_weight together, to determine when to sync DP grads | ||
| backward_counter: Counter[int] = Counter() | ||
| for time_step, action in enumerate(self.pipeline_order_with_comms[self.rank]): | ||
| if action.computation_type == OVERLAP_F_B: |
There was a problem hiding this comment.
We'll ultimately want to move this logic inside perform action right?
- we should ensure the profiler tag matches how it would look once it's moved so we don't change how users see it when we move it
- even after we support the fused action inside perform action, should we still have a fallback for running the sub actions separately (for debugging or if the model doesn't support the fused runner yet)? If so, it probably makes sense to move the code inside earlier and just not support the path where it's fused yet?
There was a problem hiding this comment.
Those are good points. Yeah, it makes sense to move the code inside perform_action. Let me think about a way to organize it better so i dont have to repeat the implementations for F and B just to handle the OVERLAP_F_B case
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Previously, we converted the overlap_f_b into separate forward and backward operations in the plan. This is a small change that includes it in the plan and handles it in the runtime Pull Request resolved: pytorch#161072 Approved by: https://github.com/fegin, https://github.com/wconstab
Stack from ghstack (oldest at bottom):
Previously, we converted the overlap_f_b into separate forward and backward operations in the plan. This is a small change that includes it in the plan and handles it in the runtime
cc @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci