Skip to content

Conversation

@syed-ahmed
Copy link
Collaborator

NCCL 2.17+ introduces some user configurable parameters for NCCL communicators using ncclConfig_t datatype and ncclCommInitRankConfig. This PR enables that feature.

A user can tune the parameters as follows:

import torch.distributed as dist
nccl_options = dist.ProcessGroupNCCL.Options()
nccl_options.config.max_ctas = 32
nccl_options.config.min_ctas = 8
nccl_options.config.cga_cluster_size = 2
dist.init_process_group(backend='nccl', init_method='env://', pg_options=nccl_options)
my_group = dist.new_group(pg_options=nccl_options)

The default values of these parameters are what is initialized by NCCL_CONFIG_INITIALIZER. Only for DistributedDataParallel, this PR sets the default value of cga_cluster_size to 2 (a heuristic that works well especially for DDP workloads).

Tuning these parameters can lead to improvement in end-to-end performance, since it affects the communication-computation overlap for NCCL kernels.

CC: @ptrblck @kwen2501

@pytorch-bot
Copy link

pytorch-bot bot commented Mar 22, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/97394

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures

As of commit e6fa35b:

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: distributed (c10d) release notes category label Mar 22, 2023
@syed-ahmed
Copy link
Collaborator Author

syed-ahmed commented Mar 22, 2023

This would need a container with NCCL 2.17 for the upstream CI to trigger unit tests. I'm currently looking at updating pytorch builder.

@syed-ahmed
Copy link
Collaborator Author

Dependent on this PR: #97407

@ngimel ngimel added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 24, 2023
Copy link
Collaborator

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general looks good to me! Thanks!
I just have some minor comments. If they are clarified, I think we are good to go.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests involving file I/O can sometimes be flaky.
I wonder if merely testing if ProcessGroupNCCL is created successfully would suffice.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did see usage of tempfile.NamedTemporaryFile() in test_c10d_nccl.py, and that's why used it here. Is there a different way to do I/O that won't be flaky? I personally wasn't satisfied testing only if ProcessGroupNCCL is created. Creation of ProcessGroupNCCL doesn't necessarily mean it was created with the config values a user may specify in the context of this PR, and AFAIK to figure out if NCCL got those values is through the NCCL_DEBUG file. May be let's wait for the 2.17.1 update to be merged in and I'll rebase this PR, and see if this test is being flaky.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, sounds like a plan.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest pipelines test with nccl 2.17.1. Test added in this PR is not flaky.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ncclConfig_t and ncclCommInitRankConfig were introduced in NCCL 2.14 and used by this PR (which checks for 2.14):
https://github.com/pytorch/pytorch/pull/95715/files#diff-8ed049a500c254f133961d941563d701696a3ee8519a14c24f0ba1ef06f13a5eR163-R172

To more clearly distinguish from that PR, maybe we can name the define here more specifically towards CTA & CGA?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to me.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed naming to NCCL_HAS_COMM_CTA_CGA

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would appreciate more information why 2 is recommended.
(The main reason being that it seems to be different from the default values picked by NCCL -- either 0 or 4 depending on SM arch:
https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/types.html#c.cgaClusterSize)

Since this part touches DDP performance, I wonder if it would be easier for this PR to land if we move it into a separate PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These numbers are coming from internal studies. Specifically, we found that using a NCCL CGA group of 2 gives the best performance when overlapped with GEMMs, whereas a NCCL CGA group of 4 is best in the absence of overlap.

Moving this to a separate PR sounds good to me.

Copy link
Collaborator

@kwen2501 kwen2501 Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for sharing the information and thanks for agreeing to move it into a separate PR!

I guess a follow-up question for that separate PR would be:
Since the process group is created outside, can we get expected effect by modifying the CGA setting here?
Are we assuming that the process group hasn't run any collective before being passed to DDP? (hence NCCL communicator hasn't been created)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the DDP related changes in this PR. Will introduce the default value of 2 for cga size and continue the discussion in a new PR.

@syed-ahmed
Copy link
Collaborator Author

@pytorchbot label ciflow/inductor ciflow/trunk ciflow/periodic

@pytorch-bot pytorch-bot bot added ciflow/inductor ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request labels Apr 19, 2023
@syed-ahmed
Copy link
Collaborator Author

syed-ahmed commented May 16, 2023

@kwen2501 I believe I have addressed your review. The failing tests don't seem related.

@syed-ahmed syed-ahmed requested a review from kwen2501 May 18, 2023 00:01
@syed-ahmed
Copy link
Collaborator Author

@kwen2501 can we get this approved?

Copy link
Collaborator

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks much for the changes! LGTM!

Copy link
Collaborator

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks much for the changes! LGTM!

@kwen2501
Copy link
Collaborator

@pytorchbot merge -f "CI failures does not seem related"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

.def_readwrite("cga_cluster_size", &ncclConfig_t::cgaClusterSize)
.def_readwrite("min_ctas", &ncclConfig_t::minCTAs)
.def_readwrite("max_ctas", &ncclConfig_t::maxCTAs)
.def_readwrite("net_name", &ncclConfig_t::netName);
Copy link

@ehuaa ehuaa Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a problem remained in pybind11 interface here. The type of netName is const char * not string, if we simply assign net_name with nccl_options.config.net_name = "Socket", there will be an UnicodeDecodeError because of storing a string into a const char * without additional working around it.
Like pybind/pybind11#2337 this issue in pybind11, to assign the net_name correctly here, we should copy the value in string to const char * with a new function, which i want to pull a new request to fix it. @syed-ahmed @kwen2501

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request Merged open source release notes: distributed (c10d) release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants