[NNC] Registerizer for GPU [1/x] #42606

nickgg · 2020-08-05T17:00:51Z

Adds a new optimization pass, the Registerizer, which looks for common Stores and Loads to a single item in a buffer and replaces them with a local temporary scalar which is cheaper to write.

For example it can replace:

A[0] = 0;
for (int x = 0; x < 10; x++) {
  A[0] = (A[0]) + x;
}

with:

int A_ = 0;
for (int x = 0; x < 10; x++) {
  A_ = x + A_;
}
A[0] = A_;

This is particularly useful on GPUs when parallelizing, since after replacing loops with metavars we have a lot of accesses like this. Early tests of simple reductions on a V100 indicates this can speed them up by ~5x.

This diff got a bit unwieldy with the integration code so that will come in a follow up.

dr-ci · 2020-08-05T17:14:01Z

💊 CI failures summary and remediations

As of commit fb6a08d (more details on the Dr. CI page):

4/4 failures possibly* introduced in this PR
- 2/4 non-CircleCI failure(s)

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_linux_xenial_py3_6_gcc5_4_test (1/2)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Aug 11 05:05:30 ERROR:sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp: In function \'int main()\':\n/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp:2:23: error: expected \';\' before \'}\' token\n int main() { return 0 }\n                       ^\n" }

Aug 11 05:05:30 Traceback (most recent call last): 
Aug 11 05:05:30   File "test/run_test.py", line 716, in <module> 
Aug 11 05:05:30     main() 
Aug 11 05:05:30   File "test/run_test.py", line 705, in main 
Aug 11 05:05:30     raise RuntimeError(err) 
Aug 11 05:05:30 RuntimeError: test_quantization failed! 
Aug 11 05:05:30 + cleanup 
Aug 11 05:05:30 + retcode=1 
Aug 11 05:05:30 + set +x 
Aug 11 05:05:30 =================== sccache compilation log =================== 
Aug 11 05:05:30 ERROR:sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp: In function \'int main()\':\n/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp:2:23: error: expected \';\' before \'}\' token\n int main() { return 0 }\n                       ^\n" } 
Aug 11 05:05:30  
Aug 11 05:05:30 =========== If your build fails, please take a look at the log above for possible reasons =========== 
Aug 11 05:05:30 Compile requests                 65 
Aug 11 05:05:30 Compile requests executed        35 
Aug 11 05:05:30 Cache hits                       27 
Aug 11 05:05:30 Cache misses                      7 
Aug 11 05:05:30 Cache timeouts                    0 
Aug 11 05:05:30 Cache read errors                 0 
Aug 11 05:05:30 Forced recaches                   0 
Aug 11 05:05:30 Cache write errors                0

pytorch_linux_xenial_py3_6_gcc5_4_ge_config_simple_test (2/2)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Aug 11 05:05:20 ERROR:sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp: In function \'int main()\':\n/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp:2:23: error: expected \';\' before \'}\' token\n int main() { return 0 }\n                       ^\n" }

Aug 11 05:05:20 Traceback (most recent call last): 
Aug 11 05:05:20   File "test/run_test.py", line 716, in <module> 
Aug 11 05:05:20     main() 
Aug 11 05:05:20   File "test/run_test.py", line 705, in main 
Aug 11 05:05:20     raise RuntimeError(err) 
Aug 11 05:05:20 RuntimeError: test_quantization failed! 
Aug 11 05:05:20 + cleanup 
Aug 11 05:05:20 + retcode=1 
Aug 11 05:05:20 + set +x 
Aug 11 05:05:20 =================== sccache compilation log =================== 
Aug 11 05:05:20 ERROR:sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp: In function \'int main()\':\n/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp:2:23: error: expected \';\' before \'}\' token\n int main() { return 0 }\n                       ^\n" } 
Aug 11 05:05:20  
Aug 11 05:05:20 =========== If your build fails, please take a look at the log above for possible reasons =========== 
Aug 11 05:05:20 Compile requests                 65 
Aug 11 05:05:20 Compile requests executed        35 
Aug 11 05:05:20 Cache hits                       27 
Aug 11 05:05:20 Cache misses                      7 
Aug 11 05:05:20 Cache timeouts                    0 
Aug 11 05:05:20 Cache read errors                 0 
Aug 11 05:05:20 Forced recaches                   0 
Aug 11 05:05:20 Cache write errors                0

ci.pytorch.org: 2 failed

Failed: pr/caffe2-pytorch-linux-xenial-rocm3.5.1-py3.6-test
Failed: pr/pytorch-linux-xenial-rocm3.5.1-py3.6

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 21 times.

zheng-xq

Great change! A few minor changes.

torch/csrc/jit/tensorexpr/registerizer.h

zheng-xq · 2020-08-05T23:22:48Z

torch/csrc/jit/tensorexpr/registerizer.h

Minor: this class has enough behavior to be put into a class. And mark its members as private.

I simplified this definition a bit, and would like to keep it as a record rather than a class.

torch/csrc/jit/tensorexpr/registerizer.h

zheng-xq · 2020-08-06T00:15:13Z

torch/csrc/jit/tensorexpr/registerizer.h

Minor: this might get quite expensive for a fairly large program. Maybe add a TODO to remind our-future-selves.

Which part gets expensive, comparing indices?

zheng-xq · 2020-08-06T00:19:23Z

torch/csrc/jit/tensorexpr/registerizer.h

Non-Blocking: I am perfectly fine with this approach in general. But I would like to point out there are a lot more cases to make it functionally correct in slightly more general cases.

From a certain perspective, the Registerizer move a global memory access to a thread-local memory. This could change the semantics if the memory access has cross-thread dependency. For example: if a global reads really needs to read the information of another atomic global writes, that access really needs to get through the global memory.

This is not likely a problem because we cannot generate that complex a program yet. But we should keep reminding ourselves of the memory semantics change.

Right, this only addresses a subset of cases where you can push accesses to a scalar. I believe it's currently pessimistic, however, if there are any accesses which may overlap a registerization candidate program-wide we won't do it. So we should be correct always but we'll leave some perf on the table.

The next step is to divide the program into sub-sections which can have distinct registerizations and write them back at the boundaries of those subsections. I have some ideas on that, but it gets more complicated.

torch/csrc/jit/tensorexpr/registerizer.h

zheng-xq · 2020-08-06T00:25:20Z

test/cpp/tensorexpr/test_registerizer.cpp

Change the comments to reflect the test.

Pardon, what do you mean here? How does this comment not reflect it?

I thought the code refers to A[x], not A[0]. No?

Yes thanks, good catch.

zheng-xq · 2020-08-06T00:26:26Z

test/cpp/tensorexpr/test_registerizer.cpp

Even in this case, please make sure you cannot replace the registers, if "x" is marked with threadIdx/blockIdx, or in the future, "paralellize".

I am assuming that this pass occurs after block/thread axes are flattened down, but I'll add a check to make this explicit.

bertmaher

In compiler lingo I think this is often called "scalar replacement", which might be worth mentioning somewhere in the block comment describing this optimization pass :-)

facebook-github-bot

@bertmaher has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

nickgg · 2020-08-06T23:42:05Z

I ran into cases where this did the wrong thing, it needs the Let Stmt PR as well as a few other changes I'll add when this lands.

nickgg · 2020-08-06T23:42:52Z

@bertmaher thanks! I didn't know the name of this pattern but I figured it was common. Would ScalarReplacer be a better name than Registerizer?

bertmaher · 2020-08-07T03:09:11Z

@bertmaher thanks! I didn't know the name of this pattern but I figured it was common. Would ScalarReplacer be a better name than Registerizer?

Oh, idk, I kinda like the name "Registerizer" :). But ScalarReplacer would maybe be more typical. Up to you!

nickgg · 2020-08-07T18:46:30Z

Since I needed another diff to land for this, I ended up rolling the next set of improvements into this change. Some changes in the last push:

Moved helpers into header files.
Now inserts definition and final store in the closest space to the first and last usage of the access, preventing bad ordering of vars (testRegisterizerAllocs covers this).
Now supports registerizing accesses which are only made up of Loads and not Stores, and does not attempt to write the value of the scalar back to the buffer (change to testRegisterizerNoLoads covers this).
Now correctly handles cases where the buffer is not initialized in the kernel, and will initialize the scalar by reading the buffer. (testRegisterizerNoInit and testRegisterizerLoadThenStore cover this)
Now hoists the definition of the scalar to the highest loop axis that it is not dependent on, meaning we now correctly cover cases where an access appears only inside an inner loop but does not depend on the loop var. (testRegisterizerNoInit and testRegisterizerLoadThenStore cover this too)
Now bails out early if there are still GPU Block Idx or Thead Idxs loop options present in the tree. (testRegisterizerParallelized covers this).

zheng-xq · 2020-08-07T19:33:04Z

test/cpp/tensorexpr/test_registerizer.cpp

I thought the code refers to A[x], not A[0]. No?

zheng-xq · 2020-08-07T19:36:57Z

torch/csrc/jit/tensorexpr/analysis.h

Is the find() function used somewhere?

I use it in a follow up (currently). I'd like to keep it for now.

zheng-xq · 2020-08-07T19:37:16Z

torch/csrc/jit/tensorexpr/analysis.h

Does this member have to be public?

No, guess not.

zheng-xq · 2020-08-07T19:45:19Z

torch/csrc/jit/tensorexpr/registerizer.h

Minor: since we are in the stage of moving around different passes and try different orders. It will be good to list some of the ordering requirement with CudaCodeGen here. For example: this must be invoked after threadIdx flattening, but must happen before pass xyz. It doesn't have to be complete, just enough to remind us where not to move this into.

zheng-xq · 2020-08-07T19:47:37Z

torch/csrc/jit/tensorexpr/stmt.h

This seems useful in general. Why not making it accept both Stmt*?

facebook-github-bot

@nickgg has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@nickgg has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-08-11T20:15:03Z

@nickgg merged this pull request in aabdef5.

nickgg requested a review from apaszke as a code owner August 5, 2020 17:00

nickgg force-pushed the registerizer branch from ec6989f to b7fd93e Compare August 5, 2020 17:03

nickgg requested review from ZolotukhinM and zheng-xq August 5, 2020 17:05

facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Aug 5, 2020

nickgg requested a review from bertmaher August 5, 2020 21:38

zheng-xq approved these changes Aug 6, 2020

View reviewed changes

bertmaher reviewed Aug 6, 2020

View reviewed changes

facebook-github-bot reviewed Aug 6, 2020

View reviewed changes

nickgg force-pushed the registerizer branch from b7fd93e to 24fda22 Compare August 7, 2020 18:35

zheng-xq approved these changes Aug 7, 2020

View reviewed changes

nickgg force-pushed the registerizer branch 2 times, most recently from 768d9ed to 9a30c03 Compare August 10, 2020 21:55

facebook-github-bot reviewed Aug 10, 2020

View reviewed changes

[NNC] Registerizer for GPU [1/x]

fb6a08d

nickgg force-pushed the registerizer branch from 9a30c03 to fb6a08d Compare August 11, 2020 04:03

facebook-github-bot reviewed Aug 11, 2020

View reviewed changes

facebook-github-bot closed this in aabdef5 Aug 11, 2020

facebook-github-bot added the merged label Aug 11, 2020

mruberry added the Merged label Oct 28, 2020

[NNC] Registerizer for GPU [1/x] #42606

[NNC] Registerizer for GPU [1/x] #42606

Uh oh!

Conversation

nickgg commented Aug 5, 2020

Uh oh!

dr-ci bot commented Aug 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🕵️ 2 new failures recognized by patterns

pytorch_linux_xenial_py3_6_gcc5_4_test (1/2)

pytorch_linux_xenial_py3_6_gcc5_4_ge_config_simple_test (2/2)

ci.pytorch.org: 2 failed

Uh oh!

zheng-xq left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bertmaher left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

nickgg commented Aug 6, 2020

Uh oh!

nickgg commented Aug 6, 2020

Uh oh!

bertmaher commented Aug 7, 2020

Uh oh!

nickgg commented Aug 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Aug 11, 2020

dr-ci bot commented Aug 5, 2020 •

edited

Loading

nickgg commented Aug 7, 2020 •

edited

Loading