Skip to content

Support OVERLAP_F_B in schedule#161072

Closed
H-Huang wants to merge 6 commits intogh/H-Huang/212/basefrom
gh/H-Huang/212/head
Closed

Support OVERLAP_F_B in schedule#161072
H-Huang wants to merge 6 commits intogh/H-Huang/212/basefrom
gh/H-Huang/212/head

Conversation

@H-Huang
Copy link
Member

@H-Huang H-Huang commented Aug 20, 2025

Stack from ghstack (oldest at bottom):

Previously, we converted the overlap_f_b into separate forward and backward operations in the plan. This is a small change that includes it in the plan and handles it in the runtime

cc @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 20, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161072

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 11 Cancelled Jobs

As of commit cad8b07 with merge base 6737e2c (image):

CANCELLED JOBS - The following jobs were cancelled. Please retry:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Aug 20, 2025
H-Huang added a commit that referenced this pull request Aug 20, 2025
ghstack-source-id: 431cc28
Pull Request resolved: #161072
@H-Huang H-Huang added the release notes: distributed (pipeline) release notes category label Aug 20, 2025
[ghstack-poisoned]
H-Huang added a commit that referenced this pull request Sep 2, 2025
ghstack-source-id: fd657e2
Pull Request resolved: #161072
[ghstack-poisoned]
@H-Huang H-Huang marked this pull request as ready for review September 2, 2025 22:25
prev_actions[rank].add(send)
comm_actions[stage_to_rank(recv.stage_index)].append(recv)
prev_actions[stage_to_rank(recv.stage_index)].add(recv)
for a in all_actions:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it taking us further from where we want to end up, to split the sub actions out? I would think you'd just plumb overlap_f_b all the way through to the runtime asap. It's easy to decompose it in the runtime into f,b actions if there isn't an actual overlap runner available. Then you can keep the ir passes simpler. Wdyt?

(I was mostly referring to appending the sub-action to the schedule, not the comms)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overlap_f_b is handled in the runtime with the code added below under _step_microbatches. This section of the code was mostly some if-statements for overlap_f_b in the plan in order to pass the tests using _simulate_comms_compute

[ghstack-poisoned]
[ghstack-poisoned]
@H-Huang H-Huang requested review from fegin and kwen2501 September 8, 2025 21:38
)
action = compute_actions[rank][0]
# handle new case where parent action (OVERLAP_F_B) can be comprised of subactions
if action is not None and action.sub_actions is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we assert action == OVERLAP_F_B here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory a user custom action with sub_actions can also get handled, so this is more general than just OVERLAP_F_B. Let me update the comment to reflect this

logger.error(
"_PipelineScheduleRuntime caught exception at step %s when running action %s. Full Schedule:",
time_step,
action,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to error out the "sub-action" or "action"? Another comment is that this try/except seems to be make code harder to read. Since you already make this a function, may be just do try/except inside the for loop below? This way you can also control which action you want to print out (sub-action or action).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point! We want to error out in the action, I will change the try-except to do it in the loop below

Copy link
Contributor

@wconstab wconstab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall Lgtm, you can land and address my comment later if you want

# count either full_backward or backward_weight together, to determine when to sync DP grads
backward_counter: Counter[int] = Counter()
for time_step, action in enumerate(self.pipeline_order_with_comms[self.rank]):
if action.computation_type == OVERLAP_F_B:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll ultimately want to move this logic inside perform action right?

  1. we should ensure the profiler tag matches how it would look once it's moved so we don't change how users see it when we move it
  2. even after we support the fused action inside perform action, should we still have a fallback for running the sub actions separately (for debugging or if the model doesn't support the fused runner yet)? If so, it probably makes sense to move the code inside earlier and just not support the path where it's fused yet?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are good points. Yeah, it makes sense to move the code inside perform_action. Let me think about a way to organize it better so i dont have to repeat the implementations for F and B just to handle the OVERLAP_F_B case

@H-Huang H-Huang added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 7, 2025
[ghstack-poisoned]
@H-Huang
Copy link
Member Author

H-Huang commented Oct 7, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Oct 21, 2025
Previously, we converted the overlap_f_b into separate forward and backward operations in the plan. This is a small change that includes it in the plan and handles it in the runtime

Pull Request resolved: pytorch#161072
Approved by: https://github.com/fegin, https://github.com/wconstab
@github-actions github-actions bot deleted the gh/H-Huang/212/head branch November 7, 2025 02:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (pipeline) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants