Skip to content

Conversation

@kwen2501
Copy link
Collaborator

@kwen2501 kwen2501 commented Apr 24, 2023

Store based barrier is not scalable.
Experimenting to see if removing it breaks any CI

@pytorch-bot
Copy link

pytorch-bot bot commented Apr 24, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/99937

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 843b5b4:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: distributed (c10d) release notes category label Apr 24, 2023
@kwen2501 kwen2501 changed the title [Do Not Land] Remove store barrier after PG init [Experimental] Remove store barrier after PG init Apr 25, 2023
Copy link
Contributor

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like CI is fine, but I think we may still need some sort of barrier? See #49419, #48110 for some prior context. If we don't have a barrier, it might be tough to figure out where is the actual synchronization point before training commences which is useful for debugging slow / stuck ranks?

@kwen2501
Copy link
Collaborator Author

kwen2501 commented Apr 25, 2023

Thanks @rohan-varma for the quick review :)
That's why this PR is experimental.
But now that all CI's are green, I plan to add an environment variable to activate this removal when the env is set.
The removal of the barrier here is preferred for large-scale training, where store based barrier is not scalable.

Copy link
Contributor

@kumpera kumpera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's worth trying removing the barrier and see what breaks.

# ranks.
# Update 04/2023: for large-scale runs, this barrier (esp. store-based
# barrier) may be costly and/or unscalable. Also, in a lot of cases,
# these barriers may be unnecessary, as proved by a green CI after
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can say just because CI is green that it's ok to remove the barrier.
You didn't even run the multi-gpu tests so the odds of running into the actual issues is slim to none.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the periodic-ci label.

# init. Default value is 1 for now to stay the same with previous behavior.
# Users can change it to 0 if such behavior is undesired. We reserve the right
# to change the default value to 0 if small rollout is successful.
_barrier_after_init = int(os.getenv("TORCH_DIST_INIT_BARRIER", "1"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update the docs as well to surface this new knob?

@kumpera kumpera added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 27, 2023
@fduwjj fduwjj added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Apr 27, 2023
Copy link
Member

@H-Huang H-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, we can try it out since default behavior remains the same, thanks for adding the flag

else:
# Use store based barrier here since barrier() used a bunch of
# default devices and messes up NCCL internal state.
_store_based_barrier(global_rank, default_store, timeout)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that this is the only place that _store_based_barrier is used and its not a public api, so if we are guarding it here with the _barrier_after_init flag, we also don't need to make any logic changes in the function right?

Copy link
Collaborator Author

@kwen2501 kwen2501 Apr 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we also don't need to make any logic changes in the function right?

If by "the function" you mean _store_based_barrier, then I believe so.

Comment on lines 3624 to 3828
if _barrier_after_init == 1:
# barrier at the end to ensure that once we return from this method, all
# process groups including global variables are updated correctly on all
# ranks.
# Update 04/2023: for large-scale runs, these barriers (esp. store-based
# barrier) may be costly and/or unscalable. Also, in a lot of cases,
# these barriers may be unnecessary, as proved by a green CI after
# removal. An environment variable `TORCH_DIST_INIT_BARRIER` has been
# added which, when set to 0, will disable these barriers.
if backend == Backend.MPI:
# MPI doesn't have store.
barrier()
else:
# Use store based barrier here since barrier() used a bunch of
# default devices and messes up NCCL internal state.
_store_based_barrier(global_rank, default_store, timeout)
else:
# Use store based barrier here since barrier() used a bunch of
# default devices and messes up NCCL internal state.
_store_based_barrier(global_rank, default_store, timeout)
logger.info(
"TORCH_DIST_INIT_BARRIER is set to 0, omitting the barrier after "
"ProcessGroup initialization."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean we call barrier twice because _new_process_group_helper is called in init_process_group ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Super sad!

@kwen2501 kwen2501 force-pushed the remove_pg_init_barrier branch from 918cf83 to 843b5b4 Compare April 27, 2023 17:12
@kwen2501
Copy link
Collaborator Author

@pytorchbot merge -f "CI green before a small rebase"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pritamdamania87
Copy link
Contributor

More context about this can also be found in #45181

@kwen2501
Copy link
Collaborator Author

Thanks for the context @pritamdamania87 !

You mentioned in #45181 that:

As a result, there is a race that after initializing the process group on say rank 0, if we immediately
check the default process group on rank 1 (say via RPC), we might actually get
an error since rank 1 hasn't yet updated its _default_pg variable.

I wonder how prevalent is such case now. Even if it exists, I think this is an issue of RPC -- its programming model implies a need for synchronization between caller and callee. So this is not a problem of init_process_group, RPC may hit the same issue with other calls too. Hence I think the synchronization should be made on the RPC side, rather than from c10d (the store-based barrier is just doing a favor here.)

The reason for removing the store-based barrier is that it is unfortunately unscalable. Large jobs today fail because of it.

If the previous concern was mainly from RPC, I am actually more confident in making the removal default, because the described scenario does not seem like something c10d should support.

@pritamdamania87
Copy link
Contributor

pritamdamania87 commented May 12, 2023

If the previous concern was mainly from RPC, I am actually more confident in making the removal default, because the described scenario does not seem like something c10d should support.

@kwen2501 I don't think this is only for RPC but in general any other mechanism that could inspect the state of the process groups externally. Removing the barrier by default seems dangerous to me and could lead hard to debug issues further down in the program. In terms of a general API contract it seems sound to have a barrier which ensures all ranks have successfully processed init_process_group before continuing. For example, if you remove the barrier some ranks could die in the middle of init_process_group and other ranks might continue their program. This could lead to weird hangs or other issues further along in the program which are hard to debug.

Regarding scalability issues, could you elaborate a bit more about what is the exact issue? Typically init_process_group is called only once during the lifetime of the program and is typically fine to incur the overhead of a barrier. Is the use case you are worried about where we have several short running programs that need to call init_process_group or does a long running program called init_process_group multiple times?

@kwen2501
Copy link
Collaborator Author

Thanks @pritamdamania87 . It is not about the overhead of the barrier. The TCPStore can just error out at that large scale.

@pritamdamania87
Copy link
Contributor

Thanks @pritamdamania87 . It is not about the overhead of the barrier. The TCPStore can just error out at that large scale.

@kwen2501 The TCPStore is anyways used during the initialization (ex: sharing ncclUniqueId across all ranks) so I'm not sure how just removing the store based barrier would help? The fix should really be making the TCPStore scalable.

@wdykas
Copy link

wdykas commented May 12, 2023

@pritamdamania87 Just to add another perspective in our case its not about erroring out but about start up time of large jobs. The hope is that by removing the barrier large jobs will start up faster without having to hit a barrier.

@pritamdamania87
Copy link
Contributor

@wdykas Thanks for sharing the perspective. Was wondering how long do your large jobs run and how much is the start time from the barrier? Typically large jobs run for a long time and some additional startup time shouldn't hurt much?

Despite that, I think having a flag to skip the store based barrier makes sense. But I'm more worried about removing the store based barrier completely or making the default to skip it.

pytorchmergebot pushed a commit that referenced this pull request Jun 7, 2023
…3033)

Both internal and OSS users trying #99937 report that their workloads perform normally even with the barrier removed and see a scalability win. Thus in this PR, we decide to make it default that PG do not perform a barrier after init.

In the discussion of #99937, people point out that such barrier might be needed for c10d + RPC cases. IMO, this need originates from RPC's programming model and should be RPC or RPC user's responsibility to deal with. That is, with other functions/libraries, it can happen too. So the need for c10d to do so big a favor is not justified IMO. Also good to remove it before users become reliant on this barrier.
Pull Request resolved: #103033
Approved by: https://github.com/XilunWu
pytorchmergebot pushed a commit that referenced this pull request Jun 18, 2023
…list (#103568)

This test(https://github.com/pytorch/pytorch/blob/8340762211e3b55caa178bac748bd902249f6fc0/test/distributed/test_multi_threaded_pg.py#L133 ) is failing on internal sandbox with the following error msg:
```
  File "/data/sandcastle/boxes/eden-trunk-hg-fbcode-fbsource/buck-out/v2/gen/fbcode/8c7462494077df89/caffe2/test/distributed/__multi_threaded__/multi_threaded#link-tree/torch/testing/_internal/distributed/multi_threaded_pg.py", line 255, in _start_coll
    raise Exception(
Exception: world not ready, only 3 PG's registered but world has 4 ranks
 exiting thread 1
ERROR
```

Internal error report: https://www.internalfb.com/intern/test/562950031915334?ref_report_id=0

We believe this is because we no longer perform barrier after init (see #99937).
This PR temporarily turn back on ```TORCH_DIST_INIT_BARRIER``` to avoid flaky test for the time being, but we should look into it to find a way to properly do this.

cc. @kumpera @kwen2501
Pull Request resolved: #103568
Approved by: https://github.com/H-Huang
@github-actions github-actions bot deleted the remove_pg_init_barrier branch October 27, 2024 02:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request Merged merging release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants