-
-
Notifications
You must be signed in to change notification settings - Fork 12.1k
[BugFix] Fix pipeline parallel #24621
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Nick Hill <nhill@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces several bug fixes related to pipeline parallelism and shutdown procedures. The changes include removing an incorrect assertion that caused failures in pipeline parallel setups, adding a defensive check to prevent errors during interpreter shutdown, and implementing a proper shutdown method for the UniProcExecutor. These fixes enhance the robustness and correctness of the distributed execution. The changes are well-implemented and address the described issues effectively.
| def shutdown(self) -> None: | ||
| if worker := self.driver_worker: | ||
| worker.shutdown() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These other changes are to fix None deref errors raised during interpreter shutdown due to things already having been garbage collected, a minor issue introduced with #22699.
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
It was broken by an incorrect
assertintroduced in #24265. The breakage was originally obscured by the other NCCL-related 4-GPU distributed tests CI breakage.This fixes the 4-GPU distributed CI test.
Edit: It looks like the separate pipeline parallel test was still passing, I'm not sure what the difference is but the torchrun version of it was failing due to this.