-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Enables configuration of NCCL communicators #97394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/97394
Note: Links to docs will display an error until the docs builds have been completed. ❌ 3 New FailuresAs of commit e6fa35b: NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
|
|
Dependent on this PR: #97407 |
kwen2501
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general looks good to me! Thanks!
I just have some minor comments. If they are clarified, I think we are good to go.
test/distributed/test_c10d_nccl.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests involving file I/O can sometimes be flaky.
I wonder if merely testing if ProcessGroupNCCL is created successfully would suffice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did see usage of tempfile.NamedTemporaryFile() in test_c10d_nccl.py, and that's why used it here. Is there a different way to do I/O that won't be flaky? I personally wasn't satisfied testing only if ProcessGroupNCCL is created. Creation of ProcessGroupNCCL doesn't necessarily mean it was created with the config values a user may specify in the context of this PR, and AFAIK to figure out if NCCL got those values is through the NCCL_DEBUG file. May be let's wait for the 2.17.1 update to be merged in and I'll rebase this PR, and see if this test is being flaky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, sounds like a plan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latest pipelines test with nccl 2.17.1. Test added in this PR is not flaky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ncclConfig_t and ncclCommInitRankConfig were introduced in NCCL 2.14 and used by this PR (which checks for 2.14):
https://github.com/pytorch/pytorch/pull/95715/files#diff-8ed049a500c254f133961d941563d701696a3ee8519a14c24f0ba1ef06f13a5eR163-R172
To more clearly distinguish from that PR, maybe we can name the define here more specifically towards CTA & CGA?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed naming to NCCL_HAS_COMM_CTA_CGA
torch/nn/parallel/distributed.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would appreciate more information why 2 is recommended.
(The main reason being that it seems to be different from the default values picked by NCCL -- either 0 or 4 depending on SM arch:
https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/types.html#c.cgaClusterSize)
Since this part touches DDP performance, I wonder if it would be easier for this PR to land if we move it into a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These numbers are coming from internal studies. Specifically, we found that using a NCCL CGA group of 2 gives the best performance when overlapped with GEMMs, whereas a NCCL CGA group of 4 is best in the absence of overlap.
Moving this to a separate PR sounds good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks for sharing the information and thanks for agreeing to move it into a separate PR!
I guess a follow-up question for that separate PR would be:
Since the process group is created outside, can we get expected effect by modifying the CGA setting here?
Are we assuming that the process group hasn't run any collective before being passed to DDP? (hence NCCL communicator hasn't been created)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the DDP related changes in this PR. Will introduce the default value of 2 for cga size and continue the discussion in a new PR.
75bfe0d to
635cced
Compare
|
@pytorchbot label ciflow/inductor ciflow/trunk ciflow/periodic |
e4a2a4f to
e6fa35b
Compare
|
@kwen2501 I believe I have addressed your review. The failing tests don't seem related. |
|
@kwen2501 can we get this approved? |
kwen2501
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks much for the changes! LGTM!
kwen2501
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks much for the changes! LGTM!
|
@pytorchbot merge -f "CI failures does not seem related" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
| .def_readwrite("cga_cluster_size", &ncclConfig_t::cgaClusterSize) | ||
| .def_readwrite("min_ctas", &ncclConfig_t::minCTAs) | ||
| .def_readwrite("max_ctas", &ncclConfig_t::maxCTAs) | ||
| .def_readwrite("net_name", &ncclConfig_t::netName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a problem remained in pybind11 interface here. The type of netName is const char * not string, if we simply assign net_name with nccl_options.config.net_name = "Socket", there will be an UnicodeDecodeError because of storing a string into a const char * without additional working around it.
Like pybind/pybind11#2337 this issue in pybind11, to assign the net_name correctly here, we should copy the value in string to const char * with a new function, which i want to pull a new request to fix it. @syed-ahmed @kwen2501
NCCL 2.17+ introduces some user configurable parameters for NCCL communicators using ncclConfig_t datatype and ncclCommInitRankConfig. This PR enables that feature.
A user can tune the parameters as follows:
The default values of these parameters are what is initialized by
NCCL_CONFIG_INITIALIZER. Only for DistributedDataParallel, this PR sets the default value of cga_cluster_size to 2 (a heuristic that works well especially for DDP workloads).Tuning these parameters can lead to improvement in end-to-end performance, since it affects the communication-computation overlap for NCCL kernels.
CC: @ptrblck @kwen2501