Speed-up multidim sum #8992

ssnl · 2018-06-28T18:10:45Z

Instead of using non _out variant, we allocate a buffer and use _out variant to write the intermediate results into the buffer.
Reduce dimensions in order of decreasing sizes.

Benchmark:
Sum a randn tensor of shape [200, 1, 30, 40, 20, 1, 50] along dimensions [4, 6, 3, 0, 2, 5]. Averaged across 1000 times:

before patch:
CPU: 0.0441 s
CUDA: 0.0273 s

after patch:
CPU: 0.0234 s
CUDA: 0.0047 s

aten/src/ATen/WrapDimUtils.h

facebook-github-bot

@ssnl has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ssnl · 2018-06-28T22:04:07Z

@pytorchbot retest this please

aten/src/ATen/WrapDimUtils.h

aten/src/ATen/native/ReduceOps.cpp

ezyang

I verified the algorithm for the non-out case and it looks correct.

facebook-github-bot

@ssnl has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@ssnl has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@ssnl has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@ssnl is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@ssnl is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: 1. Instead of using non `_out` variant, we allocate a buffer and use `_out` variant to write the intermediate results into the buffer. 2. Reduce dimensions in order of decreasing sizes. Benchmark: Sum a randn tensor of shape `[200, 1, 30, 40, 20, 1, 50]` along dimensions `[4, 6, 3, 0, 2, 5]`. Averaged across 1000 times: ``` before patch: CPU: 0.0441 s CUDA: 0.0273 s after patch: CPU: 0.0234 s CUDA: 0.0047 s ``` Closes pytorch/pytorch#8992 Differential Revision: D8681069 Pulled By: SsnL fbshipit-source-id: 2c5d5af5c5a284f2e945181f2b24ee8c78becd50

aten/src/ATen/WrapDimUtils.h

  return maybe_wrap_dim(dim, tensor_sizes[0].size());
 }

+// wrap each of dims basing on dim_post_expr


aten/src/ATen/native/ReduceOps.cpp

+// NB: this applies two optimizations:
+//   1. Reducing the dimensions in the order of decreasing size, so that the
+//      larger dimensions are dealt earlier and we can work with less elements
+//      overall.


Summary: 1. Instead of using non `_out` variant, we allocate a buffer and use `_out` variant to write the intermediate results into the buffer. 2. Reduce dimensions in order of decreasing sizes. Benchmark: Sum a randn tensor of shape `[200, 1, 30, 40, 20, 1, 50]` along dimensions `[4, 6, 3, 0, 2, 5]`. Averaged across 1000 times: ``` before patch: CPU: 0.0441 s CUDA: 0.0273 s after patch: CPU: 0.0234 s CUDA: 0.0047 s ``` Closes pytorch/pytorch#8992 Differential Revision: D8681069 Pulled By: SsnL fbshipit-source-id: 2c5d5af5c5a284f2e945181f2b24ee8c78becd50

Summary: 1. Instead of using non `_out` variant, we allocate a buffer and use `_out` variant to write the intermediate results into the buffer. 2. Reduce dimensions in order of decreasing sizes. Benchmark: Sum a randn tensor of shape `[200, 1, 30, 40, 20, 1, 50]` along dimensions `[4, 6, 3, 0, 2, 5]`. Averaged across 1000 times: ``` before patch: CPU: 0.0441 s CUDA: 0.0273 s after patch: CPU: 0.0234 s CUDA: 0.0047 s ``` Closes pytorch#8992 Differential Revision: D8681069 Pulled By: SsnL fbshipit-source-id: 2c5d5af5c5a284f2e945181f2b24ee8c78becd50

ssnl requested review from apaszke, colesbury, ezyang, gchanan, soumith and zdevito as code owners June 28, 2018 18:10

ssnl commented Jun 28, 2018

View reviewed changes

aten/src/ATen/WrapDimUtils.h Outdated

This comment was marked as off-topic.

Sign in to view

facebook-github-bot reviewed Jun 28, 2018

View reviewed changes