In pipeline parallelism: Use same dtype for receive and send tensor when initializing p2p communication. by BlueCrescent · Pull Request #165539 · pytorch/pytorch

BlueCrescent · 2025-10-15T10:39:29Z

When initializing the p2p communication for pipeline parallelism, currently different default dtypes are used for the send and receive tensor here:

pytorch/torch/distributed/pipelining/stage.py

Lines 935 to 936 in 5c583e2

    
           recv_tensor = torch.zeros(1, device=self.device) 
        
           send_tensor = torch.tensor(self.stage_index, device=self.device)

This caused hard to trace issues when training on multiple nodes. Multiple stages on one node seem to work for some reason which probably caused the unit tests not to catch this.

Fixes #165143

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci

… when initializing p2p communication.

pytorch-bot · 2025-10-15T10:39:33Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/165539

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

⏳ 1 Pending, 1 Unrelated Failure

As of commit 5974b80 with merge base 5c583e2 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / linux-jammy-py3.10-clang18-asan / test (default, 7, 7, linux.4xlarge) (gh) (detected as infra flaky with no log or failing log classifier)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

H-Huang

Thank you for root causing and fixing. Send/recv with mismatched dtypes indeed causes undefined behavior so we didn't see this in our runs. Really appreciate your investigation!

H-Huang · 2025-10-15T12:55:13Z

@pytorchbot merge

pytorchmergebot · 2025-10-15T12:57:36Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…hen initializing p2p communication. (pytorch#165539) When initializing the p2p communication for pipeline parallelism, currently different default dtypes are used for the send and receive tensor here: https://github.com/pytorch/pytorch/blob/5c583e2573f29243742e00b9fa36b266c5c78bb3/torch/distributed/pipelining/stage.py#L935-L936 This caused hard to trace issues when training on multiple nodes. Multiple stages on one node seem to work for some reason which probably caused the unit tests not to catch this. Fixes pytorch#165143 Pull Request resolved: pytorch#165539 Approved by: https://github.com/H-Huang

fix(pipeline parallelism): Use same dtype for receive and send tensor…

0138610

… when initializing p2p communication.

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 15, 2025

BlueCrescent mentioned this pull request Oct 15, 2025

Pipeline Parallelism Across Nodes Fails with EOFError pytorch/torchtitan#1852

Closed

pytorchbot added the open source label Oct 15, 2025

refactor(pipeline parallelism): Apply lintrunner.

5974b80

H-Huang added release notes: distributed (pipeline) release notes category ciflow/trunk Trigger trunk jobs on your pull request labels Oct 15, 2025

H-Huang approved these changes Oct 15, 2025

View reviewed changes

pytorchmergebot added the merging label Oct 15, 2025

pytorchmergebot added the Merged label Oct 15, 2025

pytorchmergebot closed this in ffe3cb2 Oct 15, 2025

pytorchmergebot removed the merging label Oct 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In pipeline parallelism: Use same dtype for receive and send tensor when initializing p2p communication.#165539

In pipeline parallelism: Use same dtype for receive and send tensor when initializing p2p communication.#165539
BlueCrescent wants to merge 2 commits intopytorch:mainfrom
BlueCrescent:fix_pp_use_same_dtype_when_initializing_p2p_communication

BlueCrescent commented Oct 15, 2025 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Oct 15, 2025 •

edited

Loading

Uh oh!

H-Huang left a comment

Uh oh!

H-Huang commented Oct 15, 2025

Uh oh!

pytorchmergebot commented Oct 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	recv_tensor = torch.zeros(1, device=self.device)
	send_tensor = torch.tensor(self.stage_index, device=self.device)

Conversation

BlueCrescent commented Oct 15, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/165539

⏳ 1 Pending, 1 Unrelated Failure

Uh oh!

H-Huang left a comment

Choose a reason for hiding this comment

Uh oh!

H-Huang commented Oct 15, 2025

Uh oh!

pytorchmergebot commented Oct 15, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

BlueCrescent commented Oct 15, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Oct 15, 2025 •

edited

Loading