Skip to content

Conversation

@ssnl
Copy link
Collaborator

@ssnl ssnl commented Jun 28, 2018

  1. Instead of using non _out variant, we allocate a buffer and use _out variant to write the intermediate results into the buffer.
  2. Reduce dimensions in order of decreasing sizes.

Benchmark:
Sum a randn tensor of shape [200, 1, 30, 40, 20, 1, 50] along dimensions [4, 6, 3, 0, 2, 5]. Averaged across 1000 times:

before patch:
CPU: 0.0441 s
CUDA: 0.0273 s

after patch:
CPU: 0.0234 s
CUDA: 0.0047 s

This comment was marked as off-topic.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ssnl has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ssnl
Copy link
Collaborator Author

ssnl commented Jun 28, 2018

@pytorchbot retest this please

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

@ssnl ssnl mentioned this pull request Jun 29, 2018

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified the algorithm for the non-out case and it looks correct.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ssnl has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ssnl has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ssnl has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ssnl is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ssnl is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Jun 30, 2018
Summary:
1. Instead of using non `_out` variant, we allocate a buffer and use `_out` variant to write the intermediate results into the buffer.
2. Reduce dimensions in order of decreasing sizes.

Benchmark:
Sum a randn tensor of shape `[200, 1, 30, 40, 20, 1, 50]` along dimensions `[4, 6, 3, 0, 2, 5]`. Averaged across 1000 times:
```
before patch:
CPU: 0.0441 s
CUDA: 0.0273 s

after patch:
CPU: 0.0234 s
CUDA: 0.0047 s
```
Closes pytorch/pytorch#8992

Differential Revision: D8681069

Pulled By: SsnL

fbshipit-source-id: 2c5d5af5c5a284f2e945181f2b24ee8c78becd50
@ssnl ssnl deleted the mulaxis branch June 30, 2018 03:16
return maybe_wrap_dim(dim, tensor_sizes[0].size());
}

// wrap each of dims basing on dim_post_expr

This comment was marked as off-topic.

// NB: this applies two optimizations:
// 1. Reducing the dimensions in the order of decreasing size, so that the
// larger dimensions are dealt earlier and we can work with less elements
// overall.

This comment was marked as off-topic.

This comment was marked as off-topic.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Jul 13, 2018
Summary:
1. Instead of using non `_out` variant, we allocate a buffer and use `_out` variant to write the intermediate results into the buffer.
2. Reduce dimensions in order of decreasing sizes.

Benchmark:
Sum a randn tensor of shape `[200, 1, 30, 40, 20, 1, 50]` along dimensions `[4, 6, 3, 0, 2, 5]`. Averaged across 1000 times:
```
before patch:
CPU: 0.0441 s
CUDA: 0.0273 s

after patch:
CPU: 0.0234 s
CUDA: 0.0047 s
```
Closes pytorch/pytorch#8992

Differential Revision: D8681069

Pulled By: SsnL

fbshipit-source-id: 2c5d5af5c5a284f2e945181f2b24ee8c78becd50
goodlux pushed a commit to goodlux/pytorch that referenced this pull request Aug 15, 2018
Summary:
1. Instead of using non `_out` variant, we allocate a buffer and use `_out` variant to write the intermediate results into the buffer.
2. Reduce dimensions in order of decreasing sizes.

Benchmark:
Sum a randn tensor of shape `[200, 1, 30, 40, 20, 1, 50]` along dimensions `[4, 6, 3, 0, 2, 5]`. Averaged across 1000 times:
```
before patch:
CPU: 0.0441 s
CUDA: 0.0273 s

after patch:
CPU: 0.0234 s
CUDA: 0.0047 s
```
Closes pytorch#8992

Differential Revision: D8681069

Pulled By: SsnL

fbshipit-source-id: 2c5d5af5c5a284f2e945181f2b24ee8c78becd50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants