Commit 572ff27
[RESUBMIT] Ensure ncclCommAbort can abort stuck ncclCommInitRank (#103925)
#95715 added the functionality to abort `ncclCommInitRankConfig` by specifying `blocking=0` to enable non-blocking behavior.
However, calling the `pg._abort()` didn't recover from a stuck `ncclCommInitRankConfig` since the `_abort` method only looked through `devNCCLCommMap_` map and aborted those communicators. Since `ncclCommInitRankConfig` was stuck, the communicator itself wasn't added to the map and the host thread was stuck on this line: https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L1171. As a result, `_abort` was a no-op.
To resolve this issue, I added the communicators to `inProgressCommMap_` as soon as they were created and then removed them once added to `devNCCLCommMap_`.
I also added a unit test that was failing without the changes to ProcessGroupNCCL.cpp
Pull Request resolved: #103925
Approved by: https://github.com/osalpekar1 parent b76a040 commit 572ff27
File tree
3 files changed
+76
-7
lines changed- torch
- csrc/distributed/c10d
- testing/_internal/distributed
3 files changed
+76
-7
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
772 | 772 | | |
773 | 773 | | |
774 | 774 | | |
775 | | - | |
776 | | - | |
777 | | - | |
| 775 | + | |
| 776 | + | |
| 777 | + | |
| 778 | + | |
| 779 | + | |
778 | 780 | | |
779 | 781 | | |
780 | | - | |
| 782 | + | |
781 | 783 | | |
782 | 784 | | |
783 | 785 | | |
| |||
794 | 796 | | |
795 | 797 | | |
796 | 798 | | |
797 | | - | |
| 799 | + | |
798 | 800 | | |
799 | 801 | | |
800 | 802 | | |
801 | 803 | | |
| 804 | + | |
| 805 | + | |
| 806 | + | |
| 807 | + | |
| 808 | + | |
| 809 | + | |
| 810 | + | |
802 | 811 | | |
803 | 812 | | |
804 | 813 | | |
| |||
1160 | 1169 | | |
1161 | 1170 | | |
1162 | 1171 | | |
| 1172 | + | |
| 1173 | + | |
| 1174 | + | |
| 1175 | + | |
| 1176 | + | |
1163 | 1177 | | |
1164 | 1178 | | |
1165 | 1179 | | |
| |||
1201 | 1215 | | |
1202 | 1216 | | |
1203 | 1217 | | |
1204 | | - | |
1205 | | - | |
| 1218 | + | |
| 1219 | + | |
| 1220 | + | |
| 1221 | + | |
| 1222 | + | |
| 1223 | + | |
| 1224 | + | |
| 1225 | + | |
| 1226 | + | |
| 1227 | + | |
| 1228 | + | |
| 1229 | + | |
1206 | 1230 | | |
1207 | 1231 | | |
1208 | 1232 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
640 | 640 | | |
641 | 641 | | |
642 | 642 | | |
| 643 | + | |
| 644 | + | |
| 645 | + | |
| 646 | + | |
643 | 647 | | |
644 | 648 | | |
645 | 649 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
9725 | 9725 | | |
9726 | 9726 | | |
9727 | 9727 | | |
| 9728 | + | |
| 9729 | + | |
| 9730 | + | |
| 9731 | + | |
| 9732 | + | |
| 9733 | + | |
| 9734 | + | |
| 9735 | + | |
| 9736 | + | |
| 9737 | + | |
| 9738 | + | |
| 9739 | + | |
| 9740 | + | |
| 9741 | + | |
| 9742 | + | |
| 9743 | + | |
| 9744 | + | |
| 9745 | + | |
| 9746 | + | |
| 9747 | + | |
| 9748 | + | |
| 9749 | + | |
| 9750 | + | |
| 9751 | + | |
| 9752 | + | |
| 9753 | + | |
| 9754 | + | |
| 9755 | + | |
| 9756 | + | |
| 9757 | + | |
| 9758 | + | |
| 9759 | + | |
| 9760 | + | |
| 9761 | + | |
| 9762 | + | |
| 9763 | + | |
| 9764 | + | |
| 9765 | + | |
| 9766 | + | |
| 9767 | + | |
| 9768 | + | |
9728 | 9769 | | |
9729 | 9770 | | |
9730 | 9771 | | |
| |||
0 commit comments