-
Notifications
You must be signed in to change notification settings - Fork 26.3k
port sparse_mm.reduce to pytorch and optimize it on CPU #83727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful links
❌ 18 New FailuresAs of commit 75bfbc3 (more details on the Dr. CI page): Expand to see more
🕵️ 18 new failures recognized by patternsThe following CI failures do not appear to be due to upstream breakages
|
|
|
Is it possible to make this implementation in torch.sparse? |
sure, definitely a better idea. Currently I am just trying to show the performance benefit here on some GNN workloads (to convince the management that this type of work is worthy). Still there are a lot of ongoing work here: make a better API definition for |
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/83727
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit 510535e: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
### Motivation of this PR This patch is to migrate `spmm_reduce` from `torch-sparse` (a 3rd party dependency for PyG) to `torch`, which is a response to the initial proposal for fusion of **Gather, Apply Scatter** in Message Passing of GNN inference/training. #71300 **GAS** is the major step for Message Passing, the behavior of **GAS** can be classified into 2 kinds depending on the storage type of `EdgeIndex` which records the connections of nodes: * COO: the hotspot is `scatter_reduce` * CSR: the hotspot is `spmm_reduce` The reduce type can be choose from: "max", "mean", "max", "min". extend `torch.sparse.mm` with an `reduce` argument, maps to `torch.sparse_mm.reduce` internally. `sparse_mm_reduce` is registered under the TensorTypeId of `SparseCsrCPU`, and this operator requires an internal interface `_sparse_mm_reduce_impl` which has dual outputs: * `out` - the actual output * `arg_out` - records output indices in the non zero elements if the reduce type is "max" or "min", this is only useful for training. So for inference, it will not be calculated. ### Performance Benchmark on GCN for obgn-products on Xeon single socket, the workload is improved by `4.3x` with this patch. Performance benefit for training will be bigger, the original backward impl for `sum|mean` is sequential; the original backward impl for `max|min` is not fused. #### before: ``` ----------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ----------------------------- ------------ ------------ ------------ ------------ ------------ ------------ torch_sparse::spmm_sum 97.09% 56.086s 97.09% 56.088s 6.232s 9 aten::linear 0.00% 85.000us 1.38% 795.485ms 88.387ms 9 aten::matmul 0.00% 57.000us 1.38% 795.260ms 88.362ms 9 aten::mm 1.38% 795.201ms 1.38% 795.203ms 88.356ms 9 aten::relu 0.00% 50.000us 0.76% 440.434ms 73.406ms 6 aten::clamp_min 0.76% 440.384ms 0.76% 440.384ms 73.397ms 6 aten::add_ 0.57% 327.801ms 0.57% 327.801ms 36.422ms 9 aten::log_softmax 0.00% 23.000us 0.10% 55.503ms 18.501ms 3 ``` #### after ``` ----------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ----------------------------- ------------ ------------ ------------ ------------ ------------ ------------ aten::spmm_sum 87.35% 11.826s 87.36% 11.827s 1.314s 9 aten::linear 0.00% 92.000us 5.87% 794.451ms 88.272ms 9 aten::matmul 0.00% 62.000us 5.87% 794.208ms 88.245ms 9 aten::mm 5.87% 794.143ms 5.87% 794.146ms 88.238ms 9 aten::relu 0.00% 53.000us 3.35% 452.977ms 75.496ms 6 aten::clamp_min 3.35% 452.924ms 3.35% 452.924ms 75.487ms 6 aten::add_ 2.58% 348.663ms 2.58% 348.663ms 38.740ms 9 aten::argmax 0.42% 57.473ms 0.42% 57.475ms 14.369ms 4 aten::log_softmax 0.00% 22.000us 0.39% 52.605ms 17.535ms 3 ``` cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang Guobing-Chen chunyuan-w zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire VitalyFedyunin [ghstack-poisoned]
pearu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @mingfeima for this PR!
I have a number of nits, generate CSC test samples, and a suggestion to reduce the number of arguments in the new native functions to reduce the ambiguity of naming conventions. For instance, row_indices is different in CSC context compared to the CSR context. So, the suggestion is to capture the arguments ccol_indices, row_indices, csr2csc in a single argument self_permute that captures the same information and is easier to compute (my review suggestions are not tested and I may have missed a few details but hopefully the general idea of the suggestions is clear).
| - func: _sparse_mm_reduce_impl(Tensor self, Tensor other, str reduce, *, Tensor? row_indices=None, Tensor? ccol_indices=None, Tensor? csr2csc=None) -> (Tensor, Tensor) | ||
| python_module: sparse | ||
| dispatch: | ||
| SparseCsrCPU: _sparse_mm_reduce_impl_sparse_csr_cpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SparseCsrCPU is the dispatch key for sparse compressed tensors in general, so it includes CSR, CSC, BSR, and BSC storage formats. To distinguish the formats, the layout must be used within the _sparse_mm_reduce_impl_sparse_csr_cpu implementation.
|
@pearu Thanks for your review! I have addressed your comments and updated this PR, could you please review again ? Thank you very much ! updates:
|
### Motivation of this PR This patch is to migrate `spmm_reduce` from `torch-sparse` (a 3rd party dependency for PyG) to `torch`, which is a response to the initial proposal for fusion of **Gather, Apply Scatter** in Message Passing of GNN inference/training. #71300 **GAS** is the major step for Message Passing, the behavior of **GAS** can be classified into 2 kinds depending on the storage type of `EdgeIndex` which records the connections of nodes: * COO: the hotspot is `scatter_reduce` * CSR: the hotspot is `spmm_reduce` The reduce type can be choose from: "max", "mean", "max", "min". extend `torch.sparse.mm` with an `reduce` argument, maps to `torch.sparse_mm.reduce` internally. `sparse_mm_reduce` is registered under the TensorTypeId of `SparseCsrCPU`, and this operator requires an internal interface `_sparse_mm_reduce_impl` which has dual outputs: * `out` - the actual output * `arg_out` - records output indices in the non zero elements if the reduce type is "max" or "min", this is only useful for training. So for inference, it will not be calculated. ### Performance Benchmark on GCN for obgn-products on Xeon single socket, the workload is improved by `4.3x` with this patch. Performance benefit for training will be bigger, the original backward impl for `sum|mean` is sequential; the original backward impl for `max|min` is not fused. #### before: ``` ----------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ----------------------------- ------------ ------------ ------------ ------------ ------------ ------------ torch_sparse::spmm_sum 97.09% 56.086s 97.09% 56.088s 6.232s 9 aten::linear 0.00% 85.000us 1.38% 795.485ms 88.387ms 9 aten::matmul 0.00% 57.000us 1.38% 795.260ms 88.362ms 9 aten::mm 1.38% 795.201ms 1.38% 795.203ms 88.356ms 9 aten::relu 0.00% 50.000us 0.76% 440.434ms 73.406ms 6 aten::clamp_min 0.76% 440.384ms 0.76% 440.384ms 73.397ms 6 aten::add_ 0.57% 327.801ms 0.57% 327.801ms 36.422ms 9 aten::log_softmax 0.00% 23.000us 0.10% 55.503ms 18.501ms 3 ``` #### after ``` ----------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ----------------------------- ------------ ------------ ------------ ------------ ------------ ------------ aten::spmm_sum 87.35% 11.826s 87.36% 11.827s 1.314s 9 aten::linear 0.00% 92.000us 5.87% 794.451ms 88.272ms 9 aten::matmul 0.00% 62.000us 5.87% 794.208ms 88.245ms 9 aten::mm 5.87% 794.143ms 5.87% 794.146ms 88.238ms 9 aten::relu 0.00% 53.000us 3.35% 452.977ms 75.496ms 6 aten::clamp_min 3.35% 452.924ms 3.35% 452.924ms 75.487ms 6 aten::add_ 2.58% 348.663ms 2.58% 348.663ms 38.740ms 9 aten::argmax 0.42% 57.473ms 0.42% 57.475ms 14.369ms 4 aten::log_softmax 0.00% 22.000us 0.39% 52.605ms 17.535ms 3 ``` cc jgong5 XiaobingSuper sanchitintel ashokei jingxu10 mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang Guobing-Chen chunyuan-w zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire VitalyFedyunin [ghstack-poisoned]
pearu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks, @mingfeima!
Some notes:
- The feature in this PR is implemented only for 2-D CSR tensors using CPU storage. Extending it to CSC tensors is possible but requires more work. For instance, the discussion in #83727 (comment) reveals that kernels need to be extended so that the invariant
holds. My initial estimate is that this requires the availability of both CSR and CSC indices to materialize (using
torch.sparse.mm(x.to_sparse_csr(), y, "mean") == torch.sparse.mm(x.to_sparse_csc(), y, "mean")
csc.transpose(0, 1)trick will not be sufficient). Notice that this note is not relevant for non-mean reductions that do not require values normalization. - The arguments triple
row_indices,ccol_indices,csr2cscrepresents the tensorself_permute = sparse_csr_tensor(csr.crow_indices(), csr.col_indices(), arange(csr._nnz())).to_sparse_csc()and is likely relevant also elsewhere. So we might want to re-design this API to use a single indices tensor argumentself_permuteand apply it to other cases. But not in this PR!
|
@pytorchbot merge |
Merge failedReason: Approval needed from one of the following (Rule 'superuser'): Details for Dev Infra teamRaised by workflow job |
|
@ezyang need super user for this one! |
|
@pytorchbot merge |
|
@pearu Thanks for the comments, really helpful! I will come back and perfect this PR later on. |
Merge failedReason: Approval needed from one of the following (Rule 'superuser'): Details for Dev Infra teamRaised by workflow job |
ezyang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACKing reviewer list, no direct review
|
@pytorchbot merge |
Merge failedReason: Approval needed from one of the following (Rule 'superuser'): Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…nel using CSR format (#6699) Related to important optimization in Pytorch: :white_check_mark: [port sparse_mm.reduce to pytorch and optimize it on CPU #83727 ](pytorch/pytorch#83727) Updating simplified high-level API for `spmm_reduce()` kernel and tests. The current kernel implementation has limitation to process `src` of type `torch.Tensor` in `torch.sparse_csr` format, therefore I've added an option to auto-convert `src` to CSR format using `src.to_sparse_csr()`, which is `False` by default and will result in `ValueError` if the input is not provided in the correct format. The conversion from `SparseTensor` to `torch.Tensor` is enabled by default for Pytorch > 1.13. Added transfrom to remove duplicated in ogbn-products dataset, because the new kernel can't handle duplicate entries (useful for benchmarks). _Re-opened this PR because the draft (#6689) needed to be scrapped after a rebase._ --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: rusty1s <matthias.fey@tu-dortmund.de>
Stack from ghstack (oldest at bottom):
Motivation of this PR
This patch is to migrate
spmm_reducefromtorch-sparse(a 3rd party dependency for PyG) totorch, which is a response to the initial proposal for fusion of Gather, Apply Scatter in Message Passing of GNN inference/training. #71300GAS is the major step for Message Passing, the behavior of GAS can be classified into 2 kinds depending on the storage type of
EdgeIndexwhich records the connections of nodes:scatter_reducespmm_reduceThe reduce type can be choose from: "max", "mean", "max", "min".
extend
torch.sparse.mmwith anreduceargument, maps totorch.sparse_mm.reduceinternally.sparse_mm_reduceis registered under the TensorTypeId ofSparseCsrCPU, and this operator requires an internal interface_sparse_mm_reduce_implwhich has dual outputs:out- the actual outputarg_out- records output indices in the non zero elements if the reduce type is "max" or "min", this is only useful for training. So for inference, it will not be calculated.Performance
Benchmark on GCN for obgn-products on Xeon single socket, the workload is improved by
4.3xwith this patch.Performance benefit for training will be bigger, the original backward impl for
sum|meanis sequential; the original backward impl formax|minis not fused.before:
after
cc @jgong5 @XiaobingSuper @sanchitintel @ashokei @jingxu10 @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @Guobing-Chen @chunyuan-w @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @desertfire @VitalyFedyunin