Enables configuration of NCCL communicators #97394

syed-ahmed · 2023-03-22T23:49:51Z

NCCL 2.17+ introduces some user configurable parameters for NCCL communicators using ncclConfig_t datatype and ncclCommInitRankConfig. This PR enables that feature.

A user can tune the parameters as follows:

import torch.distributed as dist
nccl_options = dist.ProcessGroupNCCL.Options()
nccl_options.config.max_ctas = 32
nccl_options.config.min_ctas = 8
nccl_options.config.cga_cluster_size = 2
dist.init_process_group(backend='nccl', init_method='env://', pg_options=nccl_options)
my_group = dist.new_group(pg_options=nccl_options)

The default values of these parameters are what is initialized by NCCL_CONFIG_INITIALIZER. Only for DistributedDataParallel, this PR sets the default value of cga_cluster_size to 2 (a heuristic that works well especially for DDP workloads).

Tuning these parameters can lead to improvement in end-to-end performance, since it affects the communication-computation overlap for NCCL kernels.

CC: @ptrblck @kwen2501

pytorch-bot · 2023-03-22T23:49:54Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/97394

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures

As of commit e6fa35b:

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

syed-ahmed · 2023-03-22T23:54:12Z

~~This would need a container with NCCL 2.17 for the upstream CI to trigger unit tests. I'm currently looking at updating pytorch builder~~.

syed-ahmed · 2023-03-23T01:36:59Z

Dependent on this PR: #97407

kwen2501

In general looks good to me! Thanks!
I just have some minor comments. If they are clarified, I think we are good to go.

kwen2501 · 2023-03-27T23:40:24Z

test/distributed/test_c10d_nccl.py

Tests involving file I/O can sometimes be flaky.
I wonder if merely testing if ProcessGroupNCCL is created successfully would suffice.

I did see usage of tempfile.NamedTemporaryFile() in test_c10d_nccl.py, and that's why used it here. Is there a different way to do I/O that won't be flaky? I personally wasn't satisfied testing only if ProcessGroupNCCL is created. Creation of ProcessGroupNCCL doesn't necessarily mean it was created with the config values a user may specify in the context of this PR, and AFAIK to figure out if NCCL got those values is through the NCCL_DEBUG file. May be let's wait for the 2.17.1 update to be merged in and I'll rebase this PR, and see if this test is being flaky.

Okay, sounds like a plan.

Latest pipelines test with nccl 2.17.1. Test added in this PR is not flaky.

kwen2501 · 2023-03-27T23:49:40Z

torch/csrc/distributed/c10d/NCCLUtils.hpp

ncclConfig_t and ncclCommInitRankConfig were introduced in NCCL 2.14 and used by this PR (which checks for 2.14):
https://github.com/pytorch/pytorch/pull/95715/files#diff-8ed049a500c254f133961d941563d701696a3ee8519a14c24f0ba1ef06f13a5eR163-R172

To more clearly distinguish from that PR, maybe we can name the define here more specifically towards CTA & CGA?

That sounds good to me.

Changed naming to NCCL_HAS_COMM_CTA_CGA

kwen2501 · 2023-03-28T00:00:40Z

torch/nn/parallel/distributed.py

I would appreciate more information why 2 is recommended.
(The main reason being that it seems to be different from the default values picked by NCCL -- either 0 or 4 depending on SM arch:
https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/types.html#c.cgaClusterSize)

Since this part touches DDP performance, I wonder if it would be easier for this PR to land if we move it into a separate PR?

These numbers are coming from internal studies. Specifically, we found that using a NCCL CGA group of 2 gives the best performance when overlapped with GEMMs, whereas a NCCL CGA group of 4 is best in the absence of overlap.

Moving this to a separate PR sounds good to me.

I see, thanks for sharing the information and thanks for agreeing to move it into a separate PR!

I guess a follow-up question for that separate PR would be:
Since the process group is created outside, can we get expected effect by modifying the CGA setting here?
Are we assuming that the process group hasn't run any collective before being passed to DDP? (hence NCCL communicator hasn't been created)

Removed the DDP related changes in this PR. Will introduce the default value of 2 for cga size and continue the discussion in a new PR.

torch/csrc/distributed/c10d/NCCLUtils.hpp

syed-ahmed · 2023-04-19T20:27:35Z

@pytorchbot label ciflow/inductor ciflow/trunk ciflow/periodic

syed-ahmed · 2023-05-16T18:50:38Z

@kwen2501 I believe I have addressed your review. The failing tests don't seem related.

syed-ahmed · 2023-05-23T19:01:51Z

@kwen2501 can we get this approved?

kwen2501

Thanks much for the changes! LGTM!

kwen2501

Thanks much for the changes! LGTM!

kwen2501 · 2023-05-25T20:44:08Z

@pytorchbot merge -f "CI failures does not seem related"

pytorchmergebot · 2023-05-25T20:46:16Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

ehuaa · 2023-06-28T07:01:52Z

torch/csrc/distributed/c10d/init.cpp

+      .def_readwrite("cga_cluster_size", &ncclConfig_t::cgaClusterSize)
+      .def_readwrite("min_ctas", &ncclConfig_t::minCTAs)
+      .def_readwrite("max_ctas", &ncclConfig_t::maxCTAs)
+      .def_readwrite("net_name", &ncclConfig_t::netName);


There's a problem remained in pybind11 interface here. The type of netName is const char * not string, if we simply assign net_name with nccl_options.config.net_name = "Socket", there will be an UnicodeDecodeError because of storing a string into a const char * without additional working around it.
Like pybind/pybind11#2337 this issue in pybind11, to assign the net_name correctly here, we should copy the value in string to const char * with a new function, which i want to pull a new request to fix it. @syed-ahmed @kwen2501

syed-ahmed requested review from H-Huang, awgu, d4l3k, fegin, kiukchung, kwen2501, mrshenli, rohan-varma, wanchaol and zhaojuanmao as code owners March 22, 2023 23:49

pytorch-bot bot added the release notes: distributed (c10d) release notes category label Mar 22, 2023

pytorchbot added the open source label Mar 23, 2023

ngimel added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 24, 2023

kwen2501 reviewed Mar 28, 2023

View reviewed changes

syed-ahmed force-pushed the nccl-rank-config branch from 75bfe0d to 635cced Compare April 18, 2023 18:55

eqy reviewed Apr 19, 2023

View reviewed changes

torch/csrc/distributed/c10d/NCCLUtils.hpp Outdated Show resolved Hide resolved

pytorch-bot bot added ciflow/inductor ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request labels Apr 19, 2023

syed-ahmed added 7 commits May 15, 2023 14:28

Adds ncclCommInitRankConfig features

cd986f0

Fixes lint and error

a86d4c7

Fixes lint

5536c35

Addresses review

c45e6a2

Cleanup

858f81d

Cleanup

23959f1

Adds missing braces

e6fa35b

syed-ahmed force-pushed the nccl-rank-config branch from e4a2a4f to e6fa35b Compare May 15, 2023 21:28

syed-ahmed requested a review from kwen2501 May 18, 2023 00:01

kwen2501 approved these changes May 25, 2023

View reviewed changes

pytorchmergebot added the merging label May 25, 2023

pytorchmergebot added Merged and removed merging labels May 25, 2023

pytorchmergebot closed this in 8708802 May 25, 2023

ehuaa reviewed Jun 28, 2023

View reviewed changes

ehuaa mentioned this pull request Jun 28, 2023

Problems of exposing const char * field in struct ncclConfig_t #104340

Closed

c-p-i-o mentioned this pull request Feb 8, 2025

dynamically set the number of SMs in torch.distributed.all_reduce #144538

Closed

Enables configuration of NCCL communicators #97394

Enables configuration of NCCL communicators #97394

Uh oh!

Conversation

syed-ahmed commented Mar 22, 2023

Uh oh!

pytorch-bot bot commented Mar 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/97394

❌ 3 New Failures

Uh oh!

syed-ahmed commented Mar 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

syed-ahmed commented Mar 23, 2023

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kwen2501 Mar 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

syed-ahmed commented Apr 19, 2023

Uh oh!

syed-ahmed commented May 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

syed-ahmed commented May 23, 2023

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

kwen2501 commented May 25, 2023

Uh oh!

pytorchmergebot commented May 25, 2023

Merge started

Uh oh!

ehuaa Jun 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

pytorch-bot bot commented Mar 22, 2023 •

edited

Loading

syed-ahmed commented Mar 22, 2023 •

edited

Loading

kwen2501 Mar 29, 2023 •

edited

Loading

syed-ahmed commented May 16, 2023 •

edited

Loading

ehuaa Jun 28, 2023 •

edited

Loading