-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[NNC] Registerizer for GPU [1/x] #42606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
💊 CI failures summary and remediationsAs of commit fb6a08d (more details on the Dr. CI page):
🕵️ 2 new failures recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
zheng-xq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great change! A few minor changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: this class has enough behavior to be put into a class. And mark its members as private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I simplified this definition a bit, and would like to keep it as a record rather than a class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: this might get quite expensive for a fairly large program. Maybe add a TODO to remind our-future-selves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which part gets expensive, comparing indices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-Blocking: I am perfectly fine with this approach in general. But I would like to point out there are a lot more cases to make it functionally correct in slightly more general cases.
From a certain perspective, the Registerizer move a global memory access to a thread-local memory. This could change the semantics if the memory access has cross-thread dependency. For example: if a global reads really needs to read the information of another atomic global writes, that access really needs to get through the global memory.
This is not likely a problem because we cannot generate that complex a program yet. But we should keep reminding ourselves of the memory semantics change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this only addresses a subset of cases where you can push accesses to a scalar. I believe it's currently pessimistic, however, if there are any accesses which may overlap a registerization candidate program-wide we won't do it. So we should be correct always but we'll leave some perf on the table.
The next step is to divide the program into sub-sections which can have distinct registerizations and write them back at the boundaries of those subsections. I have some ideas on that, but it gets more complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change the comments to reflect the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pardon, what do you mean here? How does this comment not reflect it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the code refers to A[x], not A[0]. No?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes thanks, good catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even in this case, please make sure you cannot replace the registers, if "x" is marked with threadIdx/blockIdx, or in the future, "paralellize".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am assuming that this pass occurs after block/thread axes are flattened down, but I'll add a check to make this explicit.
bertmaher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In compiler lingo I think this is often called "scalar replacement", which might be worth mentioning somewhere in the block comment describing this optimization pass :-)
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bertmaher has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
I ran into cases where this did the wrong thing, it needs the Let Stmt PR as well as a few other changes I'll add when this lands. |
|
@bertmaher thanks! I didn't know the name of this pattern but I figured it was common. Would ScalarReplacer be a better name than Registerizer? |
Oh, idk, I kinda like the name "Registerizer" :). But ScalarReplacer would maybe be more typical. Up to you! |
|
Since I needed another diff to land for this, I ended up rolling the next set of improvements into this change. Some changes in the last push:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the code refers to A[x], not A[0]. No?
torch/csrc/jit/tensorexpr/analysis.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the find() function used somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use it in a follow up (currently). I'd like to keep it for now.
torch/csrc/jit/tensorexpr/analysis.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this member have to be public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, guess not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: since we are in the stage of moving around different passes and try different orders. It will be good to list some of the ordering requirement with CudaCodeGen here. For example: this must be invoked after threadIdx flattening, but must happen before pass xyz. It doesn't have to be complete, just enough to remind us where not to move this into.
torch/csrc/jit/tensorexpr/stmt.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems useful in general. Why not making it accept both Stmt*?
768d9ed to
9a30c03
Compare
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nickgg has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nickgg has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Adds a new optimization pass, the Registerizer, which looks for common Stores and Loads to a single item in a buffer and replaces them with a local temporary scalar which is cheaper to write.
For example it can replace:
with:
This is particularly useful on GPUs when parallelizing, since after replacing loops with metavars we have a lot of accesses like this. Early tests of simple reductions on a V100 indicates this can speed them up by ~5x.
This diff got a bit unwieldy with the integration code so that will come in a follow up.