-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[Experimental] Remove store barrier after PG init #99937
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/99937
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 843b5b4: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
rohan-varma
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Thanks @rohan-varma for the quick review :) |
kumpera
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's worth trying removing the barrier and see what breaks.
| # ranks. | ||
| # Update 04/2023: for large-scale runs, this barrier (esp. store-based | ||
| # barrier) may be costly and/or unscalable. Also, in a lot of cases, | ||
| # these barriers may be unnecessary, as proved by a green CI after |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can say just because CI is green that it's ok to remove the barrier.
You didn't even run the multi-gpu tests so the odds of running into the actual issues is slim to none.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the periodic-ci label.
| # init. Default value is 1 for now to stay the same with previous behavior. | ||
| # Users can change it to 0 if such behavior is undesired. We reserve the right | ||
| # to change the default value to 0 if small rollout is successful. | ||
| _barrier_after_init = int(os.getenv("TORCH_DIST_INIT_BARRIER", "1")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you update the docs as well to surface this new knob?
H-Huang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, we can try it out since default behavior remains the same, thanks for adding the flag
| else: | ||
| # Use store based barrier here since barrier() used a bunch of | ||
| # default devices and messes up NCCL internal state. | ||
| _store_based_barrier(global_rank, default_store, timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that this is the only place that _store_based_barrier is used and its not a public api, so if we are guarding it here with the _barrier_after_init flag, we also don't need to make any logic changes in the function right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we also don't need to make any logic changes in the function right?
If by "the function" you mean _store_based_barrier, then I believe so.
| if _barrier_after_init == 1: | ||
| # barrier at the end to ensure that once we return from this method, all | ||
| # process groups including global variables are updated correctly on all | ||
| # ranks. | ||
| # Update 04/2023: for large-scale runs, these barriers (esp. store-based | ||
| # barrier) may be costly and/or unscalable. Also, in a lot of cases, | ||
| # these barriers may be unnecessary, as proved by a green CI after | ||
| # removal. An environment variable `TORCH_DIST_INIT_BARRIER` has been | ||
| # added which, when set to 0, will disable these barriers. | ||
| if backend == Backend.MPI: | ||
| # MPI doesn't have store. | ||
| barrier() | ||
| else: | ||
| # Use store based barrier here since barrier() used a bunch of | ||
| # default devices and messes up NCCL internal state. | ||
| _store_based_barrier(global_rank, default_store, timeout) | ||
| else: | ||
| # Use store based barrier here since barrier() used a bunch of | ||
| # default devices and messes up NCCL internal state. | ||
| _store_based_barrier(global_rank, default_store, timeout) | ||
| logger.info( | ||
| "TORCH_DIST_INIT_BARRIER is set to 0, omitting the barrier after " | ||
| "ProcessGroup initialization." | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean we call barrier twice because _new_process_group_helper is called in init_process_group ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Super sad!
918cf83 to
843b5b4
Compare
|
@pytorchbot merge -f "CI green before a small rebase" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
More context about this can also be found in #45181 |
|
Thanks for the context @pritamdamania87 ! You mentioned in #45181 that:
I wonder how prevalent is such case now. Even if it exists, I think this is an issue of RPC -- its programming model implies a need for synchronization between caller and callee. So this is not a problem of The reason for removing the store-based barrier is that it is unfortunately unscalable. Large jobs today fail because of it. If the previous concern was mainly from RPC, I am actually more confident in making the removal default, because the described scenario does not seem like something c10d should support. |
@kwen2501 I don't think this is only for RPC but in general any other mechanism that could inspect the state of the process groups externally. Removing the barrier by default seems dangerous to me and could lead hard to debug issues further down in the program. In terms of a general API contract it seems sound to have a barrier which ensures all ranks have successfully processed Regarding scalability issues, could you elaborate a bit more about what is the exact issue? Typically |
|
Thanks @pritamdamania87 . It is not about the overhead of the barrier. The TCPStore can just error out at that large scale. |
@kwen2501 The TCPStore is anyways used during the initialization (ex: sharing ncclUniqueId across all ranks) so I'm not sure how just removing the store based barrier would help? The fix should really be making the TCPStore scalable. |
|
@pritamdamania87 Just to add another perspective in our case its not about erroring out but about start up time of large jobs. The hope is that by removing the barrier large jobs will start up faster without having to hit a barrier. |
|
@wdykas Thanks for sharing the perspective. Was wondering how long do your large jobs run and how much is the start time from the barrier? Typically large jobs run for a long time and some additional startup time shouldn't hurt much? Despite that, I think having a flag to skip the store based barrier makes sense. But I'm more worried about removing the store based barrier completely or making the default to skip it. |
…3033) Both internal and OSS users trying #99937 report that their workloads perform normally even with the barrier removed and see a scalability win. Thus in this PR, we decide to make it default that PG do not perform a barrier after init. In the discussion of #99937, people point out that such barrier might be needed for c10d + RPC cases. IMO, this need originates from RPC's programming model and should be RPC or RPC user's responsibility to deal with. That is, with other functions/libraries, it can happen too. So the need for c10d to do so big a favor is not justified IMO. Also good to remove it before users become reliant on this barrier. Pull Request resolved: #103033 Approved by: https://github.com/XilunWu
…list (#103568) This test(https://github.com/pytorch/pytorch/blob/8340762211e3b55caa178bac748bd902249f6fc0/test/distributed/test_multi_threaded_pg.py#L133 ) is failing on internal sandbox with the following error msg: ``` File "/data/sandcastle/boxes/eden-trunk-hg-fbcode-fbsource/buck-out/v2/gen/fbcode/8c7462494077df89/caffe2/test/distributed/__multi_threaded__/multi_threaded#link-tree/torch/testing/_internal/distributed/multi_threaded_pg.py", line 255, in _start_coll raise Exception( Exception: world not ready, only 3 PG's registered but world has 4 ranks exiting thread 1 ERROR ``` Internal error report: https://www.internalfb.com/intern/test/562950031915334?ref_report_id=0 We believe this is because we no longer perform barrier after init (see #99937). This PR temporarily turn back on ```TORCH_DIST_INIT_BARRIER``` to avoid flaky test for the time being, but we should look into it to find a way to properly do this. cc. @kumpera @kwen2501 Pull Request resolved: #103568 Approved by: https://github.com/H-Huang
Store based barrier is not scalable.
Experimenting to see if removing it breaks any CI