Commit 1df24fd
[NCCL] Timeout Loop Thread for Async Error Handling (#41050)
Summary:
Pull Request resolved: #41050
**This Commit:**
We introduce a workVector to track live workNCCL objects corresponding to collective operations. Further, we introduce a workCleanupLoop, which busy-polls the vector of workNCCL objects and removes them upon completion.
**This Stack:**
The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic.
Test Plan: See D22054298 for verification of correctness and performance
Reviewed By: jiayisuse
Differential Revision: D21916637
fbshipit-source-id: f8cadaab0071aaad1c4e31f9b089aa23cba0cfbe1 parent 15cbd1c commit 1df24fd
2 files changed
+65
-6
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
228 | 228 | | |
229 | 229 | | |
230 | 230 | | |
| 231 | + | |
231 | 232 | | |
232 | 233 | | |
233 | 234 | | |
| |||
399 | 400 | | |
400 | 401 | | |
401 | 402 | | |
402 | | - | |
| 403 | + | |
403 | 404 | | |
404 | 405 | | |
405 | 406 | | |
| |||
424 | 425 | | |
425 | 426 | | |
426 | 427 | | |
| 428 | + | |
| 429 | + | |
427 | 430 | | |
428 | 431 | | |
429 | 432 | | |
430 | | - | |
| 433 | + | |
431 | 434 | | |
| 435 | + | |
432 | 436 | | |
433 | 437 | | |
434 | 438 | | |
| |||
444 | 448 | | |
445 | 449 | | |
446 | 450 | | |
| 451 | + | |
447 | 452 | | |
448 | 453 | | |
449 | 454 | | |
| |||
458 | 463 | | |
459 | 464 | | |
460 | 465 | | |
461 | | - | |
| 466 | + | |
462 | 467 | | |
463 | 468 | | |
464 | 469 | | |
| |||
554 | 559 | | |
555 | 560 | | |
556 | 561 | | |
557 | | - | |
| 562 | + | |
| 563 | + | |
| 564 | + | |
| 565 | + | |
| 566 | + | |
| 567 | + | |
| 568 | + | |
| 569 | + | |
| 570 | + | |
| 571 | + | |
| 572 | + | |
| 573 | + | |
| 574 | + | |
| 575 | + | |
| 576 | + | |
| 577 | + | |
| 578 | + | |
| 579 | + | |
| 580 | + | |
| 581 | + | |
| 582 | + | |
| 583 | + | |
| 584 | + | |
| 585 | + | |
| 586 | + | |
| 587 | + | |
558 | 588 | | |
559 | 589 | | |
560 | 590 | | |
| |||
797 | 827 | | |
798 | 828 | | |
799 | 829 | | |
| 830 | + | |
| 831 | + | |
| 832 | + | |
| 833 | + | |
| 834 | + | |
| 835 | + | |
| 836 | + | |
| 837 | + | |
800 | 838 | | |
801 | 839 | | |
802 | 840 | | |
| |||
861 | 899 | | |
862 | 900 | | |
863 | 901 | | |
| 902 | + | |
| 903 | + | |
864 | 904 | | |
865 | 905 | | |
866 | 906 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
| 3 | + | |
3 | 4 | | |
4 | 5 | | |
5 | 6 | | |
| |||
478 | 479 | | |
479 | 480 | | |
480 | 481 | | |
| 482 | + | |
| 483 | + | |
481 | 484 | | |
482 | 485 | | |
| 486 | + | |
483 | 487 | | |
484 | 488 | | |
485 | 489 | | |
| |||
521 | 525 | | |
522 | 526 | | |
523 | 527 | | |
524 | | - | |
525 | | - | |
| 528 | + | |
| 529 | + | |
526 | 530 | | |
527 | 531 | | |
528 | 532 | | |
529 | 533 | | |
530 | 534 | | |
531 | 535 | | |
532 | 536 | | |
| 537 | + | |
| 538 | + | |
| 539 | + | |
| 540 | + | |
| 541 | + | |
| 542 | + | |
| 543 | + | |
| 544 | + | |
| 545 | + | |
| 546 | + | |
| 547 | + | |
| 548 | + | |
| 549 | + | |
| 550 | + | |
| 551 | + | |
533 | 552 | | |
534 | 553 | | |
535 | 554 | | |
| |||
0 commit comments