[NNC] Fix masking for all block and thread dimensions in CudaCodeGen #44733

nickgg · 2020-09-15T20:19:47Z

Unifies a number of partial solutions to the thread and block dimension extent masking, including the NoThreadIdxWriter and my last fix #44325. The NoThreadIdxWriter is gone in favour of tracking the current loop extents and masking any statements that have a lower rank than the launch parameters in any Block or Thread dimension, which handles both the "no" and "smaller" axis binding cases.

For example it will transform the following:

for i in 0..10 // blockIdx.x
  for j in 0..10 // threadIdx.x
    do thing(i, j);
  for k in 0..5 // threadIdx.x
    do other thing(i, k);

Into:

do thing(blockIdx.x, threadIdx.x);
if (threadIdx.x < 5) {
  do other thing(blockIdx.x, threadIdx.x);
}

And handle the case where statements are not bound by any axis, eg.

do outer thing;
for i in 0..10 // blockIdx.x
  for j in 0..10 // threadIdx.x
    do thing(i, j);
  do other thing(i);

will become:

if (blockIdx.x < 1) {
  if (threadIdx.x < 1) {
    do outer thing;
  }
}
syncthreads();
do thing(blockIdx.x, threadIdx.x);
syncthreads();
if (threadIdx.x < 1) {
  do other thing(blockIdx.x);
}

dr-ci · 2020-09-15T20:30:55Z

💊 CI failures summary and remediations

As of commit b6a686f (more details on the Dr. CI page):

1/4 failures introduced in this PR
3/4 tentatively recognized as flaky ❄️
- Click here to rerun these jobs

1 failure not recognized by patterns:

Job	Step	Action
^{pytorch_linux_xenial_cuda10_2_cudnn7_py3_ge_config_profiling_test}	^{Spin up environment}	🔁 rerun

❄️ 3 failures tentatively classified as flaky

but reruns have not yet been triggered to confirm:

pytorch_xla_linux_bionic_py3_6_clang9_test (1/3)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️

Sep 16 19:17:02 CondaHTTPError: HTTP 000 CONNECTION FAILED for url

Sep 16 19:15:47 ++ [[ pytorch-xla-linux-bionic-py3.6-clang9-test == *pytorch-linux-xenial-cuda10.1-cudnn7-py3* ]] 
Sep 16 19:15:47 ++ [[ pytorch-xla-linux-bionic-py3.6-clang9-test == *pytorch-linux-trusty-py3.6-gcc7* ]] 
Sep 16 19:15:47 ++ [[ pytorch-xla-linux-bionic-py3.6-clang9-test == *pytorch_macos* ]] 
Sep 16 19:15:47 ++ BUILD_TEST_LIBTORCH=0 
Sep 16 19:15:47 ++ [[ pytorch-xla-linux-bionic-py3.6-clang9-test == *pytorch-xla-linux-bionic* ]] 
Sep 16 19:15:47 ++ which conda 
Sep 16 19:15:47 /opt/conda/bin/conda 
Sep 16 19:15:47 ++ conda install -q -y cmake 
Sep 16 19:17:02 Collecting package metadata (current_repodata.json): ...working... failed 
Sep 16 19:17:02  
Sep 16 19:17:02 CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://repo.anaconda.com/pkgs/main/noarch/current_repodata.json> 
Sep 16 19:17:02 Elapsed: - 
Sep 16 19:17:02  
Sep 16 19:17:02 An HTTP error occurred when trying to retrieve this URL. 
Sep 16 19:17:02 HTTP errors are often intermittent, and a simple retry will get you on your way. 
Sep 16 19:17:02  
Sep 16 19:17:02 If your current network has https://www.anaconda.com blocked, please file 
Sep 16 19:17:02 a support request with your network engineering team. 
Sep 16 19:17:02  
Sep 16 19:17:02 'https://repo.anaconda.com/pkgs/main/noarch' 
Sep 16 19:17:02

pytorch_linux_bionic_py3_6_clang9_build (2/3)

Step: "Build" (full log | diagnosis details | 🔁 rerun) ❄️

Sep 16 18:37:29 CondaHTTPError: HTTP 000 CONNECTION FAILED for url

Sep 16 18:37:29   rhash              pkgs/main/linux-64::rhash-1.3.8-h1ba5d50_0 
Sep 16 18:37:29  
Sep 16 18:37:29  
Sep 16 18:37:29  
Sep 16 18:37:29 CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://repo.anaconda.com/pkgs/main/linux-64/cmake-3.14.0-h52cb24c_0.conda> 
Sep 16 18:37:29 Elapsed: - 
Sep 16 18:37:29  
Sep 16 18:37:29 An HTTP error occurred when trying to retrieve this URL. 
Sep 16 18:37:29 HTTP errors are often intermittent, and a simple retry will get you on your way. 
Sep 16 18:37:29  
Sep 16 18:37:29 CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://repo.anaconda.com/pkgs/main/linux-64/krb5-1.18.2-h173b8e3_0.conda> 
Sep 16 18:37:29 Elapsed: - 
Sep 16 18:37:29  
Sep 16 18:37:29 An HTTP error occurred when trying to retrieve this URL. 
Sep 16 18:37:29 HTTP errors are often intermittent, and a simple retry will get you on your way. 
Sep 16 18:37:29  
Sep 16 18:37:29  
Sep 16 18:37:29 =================== sccache compilation log =================== 
Sep 16 18:37:29 + cleanup 
Sep 16 18:37:29 + retcode=1 
Sep 16 18:37:29 + set +x

pytorch_linux_bionic_rocm3_7_py3_6_build (3/3)

Step: "Build" (full log | diagnosis details | 🔁 rerun) ❄️

Sep 16 18:38:55 CondaHTTPError: HTTP 000 CONNECTION FAILED for url

Sep 16 18:22:51 ++ conda install -q -y cmake 
Sep 16 18:23:45 Collecting package metadata (current_repodata.json): ...working... done 
Sep 16 18:23:46 Solving environment: ...working... done 
Sep 16 18:38:55  
Sep 16 18:38:55 CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://repo.anaconda.com/pkgs/main/linux-64/cmake-3.14.0-h52cb24c_0.conda> 
Sep 16 18:38:55 Elapsed: - 
Sep 16 18:38:55  
Sep 16 18:38:55 An HTTP error occurred when trying to retrieve this URL. 
Sep 16 18:38:55 HTTP errors are often intermittent, and a simple retry will get you on your way. 
Sep 16 18:38:55  
Sep 16 18:38:55 CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://repo.anaconda.com/pkgs/main/linux-64/krb5-1.18.2-h173b8e3_0.conda> 
Sep 16 18:38:55 Elapsed: - 
Sep 16 18:38:55  
Sep 16 18:38:55 An HTTP error occurred when trying to retrieve this URL. 
Sep 16 18:38:55 HTTP errors are often intermittent, and a simple retry will get you on your way. 
Sep 16 18:38:55  
Sep 16 18:38:55  
Sep 16 18:38:55  
Sep 16 18:38:55 ## Package Plan ## 
Sep 16 18:38:55  
Sep 16 18:38:55   environment location: /opt/conda

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 8 times.

codecov · 2020-09-15T23:48:55Z

Codecov Report

Merging #44733 into master will decrease coverage by 0.00%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #44733      +/-   ##
==========================================
- Coverage   68.08%   68.08%   -0.01%     
==========================================
  Files         384      384              
  Lines       49774    49768       -6     
==========================================
- Hits        33890    33883       -7     
- Misses      15884    15885       +1

Impacted Files	Coverage Δ
torch/fx/symbolic_trace.py	`93.69% <0.00%> (-1.66%)`	⬇️
torch/fx/graph.py	`96.38% <0.00%> (-0.29%)`	⬇️
torch/optim/lr_scheduler.py	`88.77% <0.00%> (+0.04%)`	⬆️
torch/fx/proxy.py	`93.10% <0.00%> (+0.44%)`	⬆️
torch/nn/parallel/distributed.py	`41.75% <0.00%> (+0.61%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b85568a...b6a686f. Read the comment docs.

zheng-xq

A few minor comments. Thanks!

zheng-xq · 2020-09-16T07:00:13Z