[Experimental] Remove store barrier after PG init #99937

kwen2501 · 2023-04-24T22:33:39Z

Store based barrier is not scalable.
Experimenting to see if removing it breaks any CI

pytorch-bot · 2023-04-24T22:33:42Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/99937

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 843b5b4:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

rohan-varma

Looks like CI is fine, but I think we may still need some sort of barrier? See #49419, #48110 for some prior context. If we don't have a barrier, it might be tough to figure out where is the actual synchronization point before training commences which is useful for debugging slow / stuck ranks?

kwen2501 · 2023-04-25T22:24:37Z

Thanks @rohan-varma for the quick review :)
That's why this PR is experimental.
But now that all CI's are green, I plan to add an environment variable to activate this removal when the env is set.
The removal of the barrier here is preferred for large-scale training, where store based barrier is not scalable.

kumpera

I guess it's worth trying removing the barrier and see what breaks.

kumpera · 2023-04-27T13:35:53Z

torch/distributed/distributed_c10d.py

+        # ranks.
+        # Update 04/2023: for large-scale runs, this barrier (esp. store-based
+        # barrier) may be costly and/or unscalable. Also, in a lot of cases,
+        # these barriers may be unnecessary, as proved by a green CI after


I don't think we can say just because CI is green that it's ok to remove the barrier.
You didn't even run the multi-gpu tests so the odds of running into the actual issues is slim to none.

Added the periodic-ci label.

kumpera · 2023-04-27T13:36:51Z

torch/distributed/distributed_c10d.py

+# init. Default value is 1 for now to stay the same with previous behavior.
+# Users can change it to 0 if such behavior is undesired. We reserve the right
+# to change the default value to 0 if small rollout is successful.
+_barrier_after_init = int(os.getenv("TORCH_DIST_INIT_BARRIER", "1"))


Could you update the docs as well to surface this new knob?

H-Huang

LGTM, we can try it out since default behavior remains the same, thanks for adding the flag

H-Huang · 2023-04-27T16:14:19Z

torch/distributed/distributed_c10d.py

+        else:
+            # Use store based barrier here since barrier() used a bunch of
+            # default devices and messes up NCCL internal state.
+            _store_based_barrier(global_rank, default_store, timeout)


I believe that this is the only place that _store_based_barrier is used and its not a public api, so if we are guarding it here with the _barrier_after_init flag, we also don't need to make any logic changes in the function right?

we also don't need to make any logic changes in the function right?

If by "the function" you mean _store_based_barrier, then I believe so.

fduwjj · 2023-04-27T16:17:44Z

torch/distributed/distributed_c10d.py

+    if _barrier_after_init == 1:
+        # barrier at the end to ensure that once we return from this method, all
+        # process groups including global variables are updated correctly on all
+        # ranks.
+        # Update 04/2023: for large-scale runs, these barriers (esp. store-based
+        # barrier) may be costly and/or unscalable. Also, in a lot of cases,
+        # these barriers may be unnecessary, as proved by a green CI after
+        # removal. An environment variable `TORCH_DIST_INIT_BARRIER` has been
+        # added which, when set to 0, will disable these barriers.
+        if backend == Backend.MPI:
+            # MPI doesn't have store.
+            barrier()
+        else:
+            # Use store based barrier here since barrier() used a bunch of
+            # default devices and messes up NCCL internal state.
+            _store_based_barrier(global_rank, default_store, timeout)
    else:
-        # Use store based barrier here since barrier() used a bunch of
-        # default devices and messes up NCCL internal state.
-        _store_based_barrier(global_rank, default_store, timeout)
+        logger.info(
+            "TORCH_DIST_INIT_BARRIER is set to 0, omitting the barrier after "
+            "ProcessGroup initialization."
+        )


Does this mean we call barrier twice because _new_process_group_helper is called in init_process_group ?

Yes. Super sad!

kwen2501 · 2023-04-27T17:20:29Z

@pytorchbot merge -f "CI green before a small rebase"

pytorchmergebot · 2023-04-27T17:23:06Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pritamdamania87 · 2023-05-11T23:08:51Z

More context about this can also be found in #45181

kwen2501 · 2023-05-12T07:57:36Z

Thanks for the context @pritamdamania87 !

You mentioned in #45181 that:

As a result, there is a race that after initializing the process group on say rank 0, if we immediately
check the default process group on rank 1 (say via RPC), we might actually get
an error since rank 1 hasn't yet updated its _default_pg variable.

I wonder how prevalent is such case now. Even if it exists, I think this is an issue of RPC -- its programming model implies a need for synchronization between caller and callee. So this is not a problem of init_process_group, RPC may hit the same issue with other calls too. Hence I think the synchronization should be made on the RPC side, rather than from c10d (the store-based barrier is just doing a favor here.)

The reason for removing the store-based barrier is that it is unfortunately unscalable. Large jobs today fail because of it.

If the previous concern was mainly from RPC, I am actually more confident in making the removal default, because the described scenario does not seem like something c10d should support.

pritamdamania87 · 2023-05-12T22:20:03Z

If the previous concern was mainly from RPC, I am actually more confident in making the removal default, because the described scenario does not seem like something c10d should support.

@kwen2501 I don't think this is only for RPC but in general any other mechanism that could inspect the state of the process groups externally. Removing the barrier by default seems dangerous to me and could lead hard to debug issues further down in the program. In terms of a general API contract it seems sound to have a barrier which ensures all ranks have successfully processed init_process_group before continuing. For example, if you remove the barrier some ranks could die in the middle of init_process_group and other ranks might continue their program. This could lead to weird hangs or other issues further along in the program which are hard to debug.

Regarding scalability issues, could you elaborate a bit more about what is the exact issue? Typically init_process_group is called only once during the lifetime of the program and is typically fine to incur the overhead of a barrier. Is the use case you are worried about where we have several short running programs that need to call init_process_group or does a long running program called init_process_group multiple times?

kwen2501 · 2023-05-12T22:24:17Z

Thanks @pritamdamania87 . It is not about the overhead of the barrier. The TCPStore can just error out at that large scale.

pritamdamania87 · 2023-05-12T22:47:15Z

Thanks @pritamdamania87 . It is not about the overhead of the barrier. The TCPStore can just error out at that large scale.

@kwen2501 The TCPStore is anyways used during the initialization (ex: sharing ncclUniqueId across all ranks) so I'm not sure how just removing the store based barrier would help? The fix should really be making the TCPStore scalable.

wdykas · 2023-05-12T23:32:15Z

@pritamdamania87 Just to add another perspective in our case its not about erroring out but about start up time of large jobs. The hope is that by removing the barrier large jobs will start up faster without having to hit a barrier.

pritamdamania87 · 2023-05-12T23:39:45Z

@wdykas Thanks for sharing the perspective. Was wondering how long do your large jobs run and how much is the start time from the barrier? Typically large jobs run for a long time and some additional startup time shouldn't hurt much?

Despite that, I think having a flag to skip the store based barrier makes sense. But I'm more worried about removing the store based barrier completely or making the default to skip it.

…3033) Both internal and OSS users trying #99937 report that their workloads perform normally even with the barrier removed and see a scalability win. Thus in this PR, we decide to make it default that PG do not perform a barrier after init. In the discussion of #99937, people point out that such barrier might be needed for c10d + RPC cases. IMO, this need originates from RPC's programming model and should be RPC or RPC user's responsibility to deal with. That is, with other functions/libraries, it can happen too. So the need for c10d to do so big a favor is not justified IMO. Also good to remove it before users become reliant on this barrier. Pull Request resolved: #103033 Approved by: https://github.com/XilunWu

@kumpera

…list (#103568) This test(https://github.com/pytorch/pytorch/blob/8340762211e3b55caa178bac748bd902249f6fc0/test/distributed/test_multi_threaded_pg.py#L133 ) is failing on internal sandbox with the following error msg: ``` File "/data/sandcastle/boxes/eden-trunk-hg-fbcode-fbsource/buck-out/v2/gen/fbcode/8c7462494077df89/caffe2/test/distributed/__multi_threaded__/multi_threaded#link-tree/torch/testing/_internal/distributed/multi_threaded_pg.py", line 255, in _start_coll raise Exception( Exception: world not ready, only 3 PG's registered but world has 4 ranks exiting thread 1 ERROR ``` Internal error report: https://www.internalfb.com/intern/test/562950031915334?ref_report_id=0 We believe this is because we no longer perform barrier after init (see #99937). This PR temporarily turn back on ```TORCH_DIST_INIT_BARRIER``` to avoid flaky test for the time being, but we should look into it to find a way to properly do this. cc. @kumpera @kwen2501 Pull Request resolved: #103568 Approved by: https://github.com/H-Huang

kwen2501 requested review from H-Huang, awgu, d4l3k, fegin, kiukchung, mrshenli, rohan-varma, wanchaol and zhaojuanmao as code owners April 24, 2023 22:33

pytorch-bot bot added the release notes: distributed (c10d) release notes category label Apr 24, 2023

kwen2501 changed the title ~~[Do Not Land] Remove store barrier after PG init~~ [Experimental] Remove store barrier after PG init Apr 25, 2023

rohan-varma reviewed Apr 25, 2023

View reviewed changes

kumpera approved these changes Apr 27, 2023

View reviewed changes

kumpera added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 27, 2023

fduwjj added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Apr 27, 2023

H-Huang approved these changes Apr 27, 2023

View reviewed changes

fduwjj reviewed Apr 27, 2023

View reviewed changes

Add env to guard the barrier removal

843b5b4

kwen2501 force-pushed the remove_pg_init_barrier branch from 918cf83 to 843b5b4 Compare April 27, 2023 17:12

pytorchmergebot added the merging label Apr 27, 2023

pytorchmergebot added the Merged label Apr 27, 2023

pytorchmergebot closed this in ae0eb23 Apr 27, 2023

kwen2501 mentioned this pull request Jun 5, 2023

[c10d] Make it default that PG do not perform barrier after init #103033

Closed

wz337 mentioned this pull request Jun 14, 2023

[Oncall][MTPG] Fix flaky test multi_threaded - test_broadcast_object_list #103568

Closed

github-actions bot deleted the remove_pg_init_barrier branch October 27, 2024 02:08

[Experimental] Remove store barrier after PG init #99937

[Experimental] Remove store barrier after PG init #99937

Uh oh!

Conversation

kwen2501 commented Apr 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Apr 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/99937

✅ No Failures

Uh oh!

rohan-varma left a comment

Choose a reason for hiding this comment

Uh oh!

kwen2501 commented Apr 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kumpera left a comment

Choose a reason for hiding this comment

Uh oh!

kumpera Apr 27, 2023

Choose a reason for hiding this comment

Uh oh!

fduwjj Apr 27, 2023

Choose a reason for hiding this comment

Uh oh!

kumpera Apr 27, 2023

Choose a reason for hiding this comment

Uh oh!

H-Huang left a comment

Choose a reason for hiding this comment

Uh oh!

H-Huang Apr 27, 2023

Choose a reason for hiding this comment

Uh oh!

kwen2501 Apr 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fduwjj Apr 27, 2023

Choose a reason for hiding this comment

Uh oh!

kwen2501 Apr 27, 2023

Choose a reason for hiding this comment

Uh oh!

kwen2501 commented Apr 27, 2023

Uh oh!

pytorchmergebot commented Apr 27, 2023

Merge started

Uh oh!

pritamdamania87 commented May 11, 2023

Uh oh!

kwen2501 commented May 12, 2023

Uh oh!

pritamdamania87 commented May 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kwen2501 commented May 12, 2023

Uh oh!

pritamdamania87 commented May 12, 2023

Uh oh!

wdykas commented May 12, 2023

Uh oh!

pritamdamania87 commented May 12, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

kwen2501 commented Apr 24, 2023 •

edited

Loading

pytorch-bot bot commented Apr 24, 2023 •

edited

Loading

kwen2501 commented Apr 25, 2023 •

edited

Loading

kwen2501 Apr 27, 2023 •

edited

Loading

pritamdamania87 commented May 12, 2023 •

edited

Loading