Commit 8708802
Enables configuration of NCCL communicators (#97394)
NCCL 2.17+ introduces some user configurable parameters for NCCL communicators using [ncclConfig_t](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/types.html#c.ncclConfig_t) datatype and [ncclCommInitRankConfig](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/comms.html#ncclcomminitrankconfig). This PR enables that feature.
A user can tune the parameters as follows:
```
import torch.distributed as dist
nccl_options = dist.ProcessGroupNCCL.Options()
nccl_options.config.max_ctas = 32
nccl_options.config.min_ctas = 8
nccl_options.config.cga_cluster_size = 2
dist.init_process_group(backend='nccl', init_method='env://', pg_options=nccl_options)
my_group = dist.new_group(pg_options=nccl_options)
```
The default values of these parameters are what is initialized by `NCCL_CONFIG_INITIALIZER`. Only for DistributedDataParallel, this PR sets the default value of cga_cluster_size to 2 (a heuristic that works well especially for DDP workloads).
Tuning these parameters can lead to improvement in end-to-end performance, since it affects the communication-computation overlap for NCCL kernels.
CC: @ptrblck @kwen2501
Pull Request resolved: #97394
Approved by: https://github.com/kwen25011 parent 3cae6d2 commit 8708802
File tree
5 files changed
+106
-15
lines changed- test/distributed
- torch/csrc/distributed/c10d
5 files changed
+106
-15
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
4 | 4 | | |
5 | 5 | | |
6 | 6 | | |
| 7 | + | |
7 | 8 | | |
8 | 9 | | |
9 | 10 | | |
| |||
2713 | 2714 | | |
2714 | 2715 | | |
2715 | 2716 | | |
2716 | | - | |
2717 | | - | |
2718 | | - | |
2719 | | - | |
2720 | | - | |
2721 | | - | |
| 2717 | + | |
2722 | 2718 | | |
2723 | 2719 | | |
2724 | 2720 | | |
| |||
2737 | 2733 | | |
2738 | 2734 | | |
2739 | 2735 | | |
| 2736 | + | |
| 2737 | + | |
| 2738 | + | |
| 2739 | + | |
| 2740 | + | |
| 2741 | + | |
| 2742 | + | |
| 2743 | + | |
| 2744 | + | |
| 2745 | + | |
| 2746 | + | |
| 2747 | + | |
| 2748 | + | |
| 2749 | + | |
| 2750 | + | |
| 2751 | + | |
| 2752 | + | |
| 2753 | + | |
| 2754 | + | |
| 2755 | + | |
| 2756 | + | |
| 2757 | + | |
| 2758 | + | |
| 2759 | + | |
| 2760 | + | |
| 2761 | + | |
| 2762 | + | |
| 2763 | + | |
| 2764 | + | |
| 2765 | + | |
| 2766 | + | |
2740 | 2767 | | |
2741 | 2768 | | |
2742 | 2769 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
52 | 52 | | |
53 | 53 | | |
54 | 54 | | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
55 | 61 | | |
56 | 62 | | |
57 | 63 | | |
| |||
179 | 185 | | |
180 | 186 | | |
181 | 187 | | |
182 | | - | |
183 | 188 | | |
184 | 189 | | |
185 | | - | |
186 | | - | |
187 | | - | |
188 | | - | |
189 | | - | |
190 | | - | |
191 | | - | |
192 | | - | |
193 | 190 | | |
194 | 191 | | |
195 | 192 | | |
196 | 193 | | |
197 | 194 | | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
198 | 216 | | |
199 | 217 | | |
200 | 218 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1156 | 1156 | | |
1157 | 1157 | | |
1158 | 1158 | | |
| 1159 | + | |
| 1160 | + | |
| 1161 | + | |
1159 | 1162 | | |
| 1163 | + | |
1160 | 1164 | | |
1161 | 1165 | | |
1162 | 1166 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
279 | 279 | | |
280 | 280 | | |
281 | 281 | | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
| 286 | + | |
282 | 287 | | |
283 | 288 | | |
284 | 289 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2135 | 2135 | | |
2136 | 2136 | | |
2137 | 2137 | | |
| 2138 | + | |
| 2139 | + | |
| 2140 | + | |
| 2141 | + | |
| 2142 | + | |
| 2143 | + | |
| 2144 | + | |
| 2145 | + | |
| 2146 | + | |
| 2147 | + | |
| 2148 | + | |
| 2149 | + | |
| 2150 | + | |
| 2151 | + | |
| 2152 | + | |
| 2153 | + | |
| 2154 | + | |
2138 | 2155 | | |
2139 | 2156 | | |
2140 | 2157 | | |
| |||
2148 | 2165 | | |
2149 | 2166 | | |
2150 | 2167 | | |
| 2168 | + | |
| 2169 | + | |
| 2170 | + | |
| 2171 | + | |
| 2172 | + | |
| 2173 | + | |
| 2174 | + | |
| 2175 | + | |
2151 | 2176 | | |
2152 | 2177 | | |
2153 | 2178 | | |
2154 | 2179 | | |
| 2180 | + | |
| 2181 | + | |
| 2182 | + | |
| 2183 | + | |
2155 | 2184 | | |
2156 | 2185 | | |
2157 | 2186 | | |
2158 | 2187 | | |
| 2188 | + | |
| 2189 | + | |
| 2190 | + | |
| 2191 | + | |
| 2192 | + | |
| 2193 | + | |
2159 | 2194 | | |
2160 | 2195 | | |
2161 | 2196 | | |
2162 | 2197 | | |
2163 | 2198 | | |
| 2199 | + | |
| 2200 | + | |
2164 | 2201 | | |
2165 | 2202 | | |
2166 | 2203 | | |
| |||
0 commit comments