Skip to content

In pipeline parallelism: Use same dtype for receive and send tensor when initializing p2p communication.#165539

Closed
BlueCrescent wants to merge 2 commits intopytorch:mainfrom
BlueCrescent:fix_pp_use_same_dtype_when_initializing_p2p_communication
Closed

In pipeline parallelism: Use same dtype for receive and send tensor when initializing p2p communication.#165539
BlueCrescent wants to merge 2 commits intopytorch:mainfrom
BlueCrescent:fix_pp_use_same_dtype_when_initializing_p2p_communication

Conversation

@BlueCrescent
Copy link
Contributor

@BlueCrescent BlueCrescent commented Oct 15, 2025

When initializing the p2p communication for pipeline parallelism, currently different default dtypes are used for the send and receive tensor here:

recv_tensor = torch.zeros(1, device=self.device)
send_tensor = torch.tensor(self.stage_index, device=self.device)

This caused hard to trace issues when training on multiple nodes. Multiple stages on one node seem to work for some reason which probably caused the unit tests not to catch this.

Fixes #165143

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 15, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/165539

Note: Links to docs will display an error until the docs builds have been completed.

⏳ 1 Pending, 1 Unrelated Failure

As of commit 5974b80 with merge base 5c583e2 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 15, 2025
@H-Huang H-Huang added release notes: distributed (pipeline) release notes category ciflow/trunk Trigger trunk jobs on your pull request labels Oct 15, 2025
Copy link
Member

@H-Huang H-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for root causing and fixing. Send/recv with mismatched dtypes indeed causes undefined behavior so we didn't see this in our runs. Really appreciate your investigation!

@H-Huang
Copy link
Member

H-Huang commented Oct 15, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Oct 21, 2025
…hen initializing p2p communication. (pytorch#165539)

When initializing the p2p communication for pipeline parallelism, currently different default dtypes are used for the send and receive tensor here:
https://github.com/pytorch/pytorch/blob/5c583e2573f29243742e00b9fa36b266c5c78bb3/torch/distributed/pipelining/stage.py#L935-L936

This caused hard to trace issues when training on multiple nodes. Multiple stages on one node seem to work for some reason which probably caused the unit tests not to catch this.

Fixes pytorch#165143

Pull Request resolved: pytorch#165539
Approved by: https://github.com/H-Huang
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (pipeline) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pipeline Parallelism Across Nodes Fails with EOFError

4 participants